[go: up one dir, main page]

CN111539196A - Text duplicate checking method and device, text management system and electronic equipment - Google Patents

Text duplicate checking method and device, text management system and electronic equipment Download PDF

Info

Publication number
CN111539196A
CN111539196A CN202010297125.0A CN202010297125A CN111539196A CN 111539196 A CN111539196 A CN 111539196A CN 202010297125 A CN202010297125 A CN 202010297125A CN 111539196 A CN111539196 A CN 111539196A
Authority
CN
China
Prior art keywords
word
text
checked
vector
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010297125.0A
Other languages
Chinese (zh)
Inventor
孟庆典
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Optoelectronics Technology Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Optoelectronics Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Optoelectronics Technology Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202010297125.0A priority Critical patent/CN111539196A/en
Publication of CN111539196A publication Critical patent/CN111539196A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a text duplicate checking method, a text duplicate checking device, a text management system and electronic equipment, wherein the text duplicate checking method comprises the following steps: acquiring a plurality of parts of duplicate texts to be checked; performing word division on each duplicate text to be checked according to a word database, and determining the word frequency of each word in the duplicate text to be checked; determining word vectors and vector weights of the word vectors in each text to be checked according to the words and the word frequencies corresponding to the words; and determining the repetition rate between any two texts to be checked according to the vector weight. According to the text duplication checking method, text words are divided through the word database, words appearing in the word database and word frequency of the words in the text are determined, vector weights of the words are calculated according to the words and word vectors obtained according to the corresponding word frequency, and then the duplication rate of any two texts is determined through comparison of the vector weights, so that the duplication rate comparison of any two texts can be performed efficiently and accurately.

Description

文本查重的方法、装置、文本管理系统及电子设备Method, device, text management system and electronic device for text duplication checking

技术领域technical field

本申请涉及计算机技术领域,具体而言,本申请涉及一种文本查重方法、装置、文本管理系统及电子设备。The present application relates to the field of computer technology, and in particular, the present application relates to a text duplication checking method, device, text management system and electronic device.

背景技术Background technique

在企业的经营管理过程中,招投标是一项进行较为频繁的活动。为了选择出更符合各方利益的合作方,需要对投标方提供的投标资料进行阅读审查,并据以权衡出最优合作伙伴。然而,一个招标方往往对应众多的投标方,收到的标书数量较多,并且每份标书的内容也较多,很难在短时间内找到合适的投标方。阅读和筛查标书是一项繁琐的工作,需要大量的人力物力,并且由于标书都是针对同一个项目主题,因此通常会存在内容较为相似的情况,如果通过人工对比出各个标书的相似程度并据以进行商业伙伴选择,将是一件耗时耗力的工作。因此人们通常会采用一些文本查重的技术。In the process of business management, bidding is a relatively frequent activity. In order to choose a partner that is more in line with the interests of all parties, it is necessary to read and review the bidding materials provided by the bidder, and weigh the optimal partner accordingly. However, a tenderer often corresponds to many bidders, and the number of bids received is large, and each bid has a lot of content, so it is difficult to find a suitable bidder in a short period of time. Reading and screening bids is a tedious task that requires a lot of manpower and material resources, and because the bids are all for the same project theme, there is usually a situation where the content is relatively similar. According to the selection of business partners, it will be a time-consuming and labor-intensive work. Therefore, people usually use some text-checking techniques.

现有技术中,通常会采用分段技术以及语义分析技术,使用最长字符串匹配原则,计算文档中字符重复率。然而,语义分析技术需要性能较高的硬件设备来支持标书的语义识别,往往存在计算资源占用过高的问题,并且该项技术相对封闭,应用推广范围较窄。而基于最长字符串匹配原则,往往会遗漏很多关键词,文档查重的准确度不够高。In the prior art, segmentation technology and semantic analysis technology are usually used, and the longest string matching principle is used to calculate the repetition rate of characters in the document. However, the semantic analysis technology requires high-performance hardware devices to support the semantic recognition of bid documents, which often has the problem of excessive computing resources. Moreover, the technology is relatively closed and the scope of application and promotion is narrow. However, based on the longest string matching principle, many keywords are often missed, and the accuracy of document duplication checking is not high enough.

发明内容SUMMARY OF THE INVENTION

本申请针对现有方式的缺点,提出一种文本查重方法、装置、文本管理系统及电子设备,用以解决现有技术存在文本查重效率或准确率不高的技术问题。Aiming at the shortcomings of the existing methods, the present application proposes a method, device, text management system and electronic equipment for text duplication checking to solve the technical problem of low efficiency or accuracy of text duplication checking in the prior art.

第一个方面,本申请实施例提供了一种文本查重的方法,包括:In a first aspect, an embodiment of the present application provides a method for checking text duplicates, including:

获取若干份待查重文本;Get a number of duplicate texts to be checked;

根据词语数据库对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频;According to the word database, word division is performed for each text to be checked, and the word frequency of each word in the text to be checked is determined;

根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重;According to the word and the word frequency corresponding to the word, determine the word vector and the vector weight of the word vector in each text to be checked;

根据向量权重,确定任意两篇待查重文本之间的重复率。According to the vector weight, the repetition rate between any two texts to be checked is determined.

在第一个方面的某些实现方式中,根据词语数据库,对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频,包括:In some implementations of the first aspect, according to the word database, words are divided for each repeated text to be checked, and the word frequency of each word in the repeated text to be checked is determined, including:

根据预设语义策略,将待查重文本划分为若干词语;According to the preset semantic strategy, the text to be checked is divided into several words;

若词语存在于词语数据库中,则确定词语为第一类词语;If the word exists in the word database, the word is determined to be the first type of word;

若词语不存在于词语数据库中,则确定词语为第二类词语;If the word does not exist in the word database, the word is determined to be the second type of word;

确定每个第一类词语在待查重文本中的词频。Determine the word frequency of each first-class word in the text to be checked.

结合第一个方面和上述实现方式,在第一个方面的某些实现方式中,确定每个词语在待查重文本中的词频,包括:In combination with the first aspect and the above implementation manner, in some implementation manners of the first aspect, the word frequency of each word in the text to be checked is determined, including:

确定待查重文本的词语总数,以及每个词语的重复次数;Determine the total number of words in the duplicate text to be checked, and the number of repetitions of each word;

根据词语总数和每个词语的重复次数,确定每个词语在待查重文本中的词频。According to the total number of words and the number of repetitions of each word, the word frequency of each word in the text to be checked is determined.

结合第一个方面和上述实现方式,在第一个方面的某些实现方式中,根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重,包括:In combination with the first aspect and the above implementations, in some implementations of the first aspect, the word vector and the vector weight of the word vector in each text to be checked are determined according to the word and the word frequency corresponding to the word, including:

根据待查重文本的全部词语,组建文本向量空间;According to all the words of the repeated text to be checked, a text vector space is formed;

根据待查重文本的每个词语以及每个词语对应的词频,确定每个词语的词语向量;Determine the word vector of each word according to each word of the duplicate text to be checked and the word frequency corresponding to each word;

根据文本向量空间和每个词语的词语向量,确定词语向量在文本向量空间中的权重。According to the text vector space and the word vector of each word, determine the weight of the word vector in the text vector space.

结合第一个方面和上述实现方式,在第一个方面的某些实现方式中,根据向量权重,确定任意两篇待查重文本的相似度,包括:In combination with the first aspect and the above implementation manner, in some implementation manners of the first aspect, the similarity of any two duplicate texts to be checked is determined according to the vector weight, including:

确定两篇待查重文本中的全部相同词语;Determine all the same words in the two duplicate texts to be checked;

根据相同词语在两篇待查重文档中的向量权重,确定相同词语在两篇待查重文档中的重合率;According to the vector weights of the same words in the two documents to be checked, the coincidence rate of the same words in the two documents to be checked is determined;

根据全部相同词语的重合率,确定任意两篇待查重文本的重复率。According to the coincidence rate of all the same words, the repetition rate of any two duplicate texts to be checked is determined.

第二个方面,本申请提供了一种文本查重装置,包括:In a second aspect, the application provides a text checking device, including:

获取模块,用于获取若干份待查重文本;The acquisition module is used to acquire several duplicate texts to be checked;

分词模块,用于根据词语数据库,对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频;The word segmentation module is used to divide words for each text to be checked according to the word database, and determine the word frequency of each word in the text to be checked;

向量模块,用于根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重;The vector module is used to determine the word vector and the vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word;

查重模块,用于根据向量权重,确定任意两篇待查重文本的重复率。The duplicate checking module is used to determine the repetition rate of any two duplicate texts to be checked according to the vector weight.

第三个方面,本申请提供了一种文本管理方法,包括:In a third aspect, the present application provides a text management method, including:

获取身份验证信息,并确定身份验证结果;Obtain authentication information and determine the authentication result;

根据身份验证结果,处理与身份验证信息对应的文本;According to the authentication result, process the text corresponding to the authentication information;

根据如本申请第一个方面描述的文本查重的方法,对文本进行查重。According to the method for text duplication checking as described in the first aspect of the present application, the text is checked for duplication.

第四个方面,本申请提供了一种文本管理系统,包括:In a fourth aspect, the application provides a text management system, including:

账号管理装置,用于获取身份验证信息,并确定身份验证结果;The account management device is used to obtain the identity verification information and determine the identity verification result;

文件传输装置,根据身份验证结果,处理与身份验证信息对应的文本;a file transmission device, processing the text corresponding to the identity verification information according to the identity verification result;

文本查重装置,用于根据如本申请第一个方面描述的文本查重的方法,对文本进行查重。A text duplication checking device is used to check text for duplication according to the method for text duplication checking described in the first aspect of the present application.

第五个方面,本申请提供了一种电子设备,包括:In a fifth aspect, the present application provides an electronic device, including:

处理器;processor;

存储器,与处理器电连接;a memory, electrically connected to the processor;

至少一个程序,被存储在存储器中并被配置为由处理器执行,至少一个程序被配置用于:实现如本申请第一个方面描述的文本查重的方法。At least one program, stored in the memory and configured to be executed by the processor, is configured to: implement the method for text duplication checking as described in the first aspect of the present application.

第六个方面,本申请提供了一种计算机可读存储介质,计算机存储介质用于存储计算机指令,当计算机指令在计算机上运行时,实现如上述如本申请第一个方面描述的文本查重的方法。In a sixth aspect, the present application provides a computer-readable storage medium, where the computer storage medium is used to store computer instructions, and when the computer instructions are executed on a computer, the above-mentioned text duplication checking as described in the first aspect of the present application is implemented. Methods.

本申请实施例提供的技术方案带来的有益技术效果是:The beneficial technical effects brought by the technical solutions provided in the embodiments of the present application are:

本申请提供的文本查重方法通过词语数据库进行文本词语划分,确定文本中每个在词语数据库中出现的词语以及词语的词频,并根据词语及相应词频获得的词语向量计算词语的向量权重,然后通过向量权重的比较,确定两篇文本之间的相似度,该方法词语划分准确且高效率,因此能够高效率并且更精确地进行任意两篇文章的重复率比较。The text duplicate checking method provided by this application divides the text words through the word database, determines each word in the text that appears in the word database and the word frequency of the word, and calculates the vector weight of the word according to the word vector obtained by the word and the corresponding word frequency, and then Through the comparison of vector weights, the similarity between two texts is determined. This method divides words accurately and efficiently, so it can efficiently and more accurately compare the repetition rate of any two texts.

本申请附加的方面和优点将在下面的描述中部分给出,这些将从下面的描述中变得明显,或通过本申请的实践了解到。Additional aspects and advantages of the present application will be set forth in part in the following description, which will become apparent from the following description, or may be learned by practice of the present application.

附图说明Description of drawings

本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:The above and/or additional aspects and advantages of the present application will become apparent and readily understood from the following description of embodiments taken in conjunction with the accompanying drawings, wherein:

图1为本申请实施例提供的一种文本管理方法的流程示意图;1 is a schematic flowchart of a text management method provided by an embodiment of the present application;

图2为本申请实施例提供的一种文本管理系统的结构框架示意图;2 is a schematic structural framework diagram of a text management system provided by an embodiment of the present application;

图3为本申请实施例提供的一种文本查重的方法流程示意图;3 is a schematic flowchart of a method for text duplication checking provided by an embodiment of the present application;

图4为本申请实施例提供的根据词语数据库,对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频的方法流程示意图;4 is a schematic flowchart of a method for performing word division on each to-be-checked text according to a word database provided by the embodiment of the present application, and determining the word frequency of each word in the to-be-checked text;

图5为本申请实施例提供的根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重的方法流程示意图;5 is a schematic flowchart of a method for determining a word vector and a vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word according to the embodiment of the present application;

图6为本申请实施例提供的根据向量权重,确定两两待查重文本的重复率的方法流程示意图;6 is a schematic flowchart of a method for determining the repetition rate of duplicate texts to be checked in pairs according to vector weights provided by an embodiment of the present application;

图7为本申请实施例提供的一种文本查重装置的结构框架示意图;7 is a schematic structural framework diagram of a text checking device provided by an embodiment of the present application;

图8为本申请实施例提供的一种电子设备的结构框架示意图。FIG. 8 is a schematic structural framework diagram of an electronic device according to an embodiment of the present application.

具体实施方式Detailed ways

下面详细描述本申请,本申请的实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的部件或具有相同或类似功能的部件。此外,如果已知技术的详细描述对于示出的本申请的特征是不必要的,则将其省略。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本申请的限制。The application is described in detail below, and examples of embodiments of the application are illustrated in the accompanying drawings, wherein the same or similar reference numerals refer to the same or similar components or components having the same or similar functions throughout. Also, detailed descriptions of known technologies are omitted if they are not necessary for illustrating features of the present application. The embodiments described below with reference to the accompanying drawings are exemplary and are only used to explain the present application, but not to be construed as a limitation on the present application.

本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本申请所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. It should also be understood that terms, such as those defined in a general dictionary, should be understood to have meanings consistent with their meanings in the context of the prior art and, unless specifically defined as herein, should not be interpreted in idealistic or overly formal meaning to explain.

本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the specification of this application refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not preclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It should be understood that the term "and/or" as used herein includes all or any element and all combinations of one or more of the associated listed items.

企业在经验活动中,经常需要与其他企业合作从事某项工作,因此进行招投标管理是企业的必要活动之一,为快速统计出众多标书的重复率或者相似度,需要对标书中的文本进行分析,然而利用现有技术中提供的查重方法,要么需要高性能的硬件提供足够精确度的查重效果,以及查重效率,要么查重准确率不足,难以作为应用参考。In the experience activities, enterprises often need to cooperate with other enterprises to engage in a certain work, so bidding management is one of the necessary activities for enterprises. However, using the duplication checking method provided in the prior art, either high-performance hardware is required to provide sufficient accurate duplication checking effect and duplication checking efficiency, or the duplication checking accuracy rate is insufficient, which is difficult to be used as an application reference.

为了解决上述问题,本申请提供了一种文本查重方法、装置、文档管理系统及电子设备。In order to solve the above problems, the present application provides a method, device, document management system and electronic device for checking text duplicates.

首先,本申请的实施例从总体角度提供一种文本管理方法,该文本管理方法可用于进行招投标文件或文本的管理,该文本管理方法的流程示意图如图1所示,至少包括如下几个步骤:First, the embodiments of the present application provide a text management method from an overall perspective, and the text management method can be used to manage bidding documents or texts. The schematic flowchart of the text management method is shown in FIG. 1 , and includes at least the following step:

S100:获取身份验证信息,并确定身份验证结果。S100: Acquire authentication information, and determine an authentication result.

S200:根据身份验证结果,处理与身份验证信息对应的文本。S200: According to the identity verification result, process the text corresponding to the identity verification information.

S300:根据本申请描述的文本查重的方法,对文本进行查重。S300 : Check the text for duplication according to the method for checking the text for duplication described in this application.

本申请描述的文本查重的方法将在后文中进行详细介绍。The method for text duplication checking described in this application will be described in detail later.

基于同一发明构思,本申请的实施例相应提供了一种文本管理系统,用于实现上述的文本管理方法,具体而言,如图2所示,该文本管理系统至少包括如下几个装置:账号管理装置、文件传输装置和文本查重装置。其中账号管理装置用于获取身份验证信息,并确定身份验证结果。文件传输装置用于根据身份验证结果,处理与身份验证信息对应的文本。文本查重装置用于根据如本申请描述的文本查重的方法,对文本进行查重。Based on the same inventive concept, the embodiments of the present application correspondingly provide a text management system for implementing the above-mentioned text management method. Specifically, as shown in FIG. 2 , the text management system at least includes the following devices: Management device, file transfer device and text check device. The account management device is used for acquiring identity verification information and determining the identity verification result. The file transmission device is used for processing the text corresponding to the identity verification information according to the identity verification result. The text duplication checking device is used to check the text for duplication according to the method for text duplication checking as described in this application.

具体而言,账号管理装置主要负责实现本申请实施例提供的文本管理系统的账号管理功能,这些账号管理功能包括账号注册、账号管理和账号登陆等功能。例如,账号注册表示采集投标人录入信息,审核通过后生成投标人账号。账号管理表示将注册成功的投标人账号信息录入数据库,从而进行管理。账号登录表示验证当前登录账号是否是已注册账号,验证成功后可进入投标人操作界面。当然,通过账号管理装置,也能够区分出对应不同系统权限的招标人和投标人。Specifically, the account management device is mainly responsible for implementing the account management functions of the text management system provided by the embodiments of the present application, and these account management functions include account registration, account management, account login and other functions. For example, account registration means collecting the bidder's input information, and generating the bidder's account after approval. Account management means entering the account information of the bidders who have successfully registered into the database for management. Account login means to verify whether the current login account is a registered account. After the verification is successful, you can enter the bidder operation interface. Of course, through the account management device, tenderers and bidders corresponding to different system rights can also be distinguished.

相应地,文件传输装置负责实现本系统的文件或文本的传输功能,包括文本上传和文本下载功能。文本上传表示准许投标人上传文本到指定目录,而文本下载表示准许招标方或投标方下载已上传的文档。也即,成功通过账号管理装置登录文本管理系统的投标人,可上传自己的标书或者下载标书以及其他与投标人权限对应的文件。Correspondingly, the file transmission device is responsible for realizing the file or text transmission function of the system, including the text uploading and text downloading functions. Text upload means permitting bidders to upload text to the designated directory, while text download means permitting the tenderee or bidder to download the uploaded document. That is, bidders who successfully log in to the text management system through the account management device can upload their own bid documents or download bid documents and other documents corresponding to the bidder's authority.

文本查重装置负责实现文本进行查重的功能,查找文本管理系统中的有关文本,例如查找范围为某次招标活动全体投标人上传的文本,查重结果以任意两篇文本的相似度的形式给出。详细查重方法可通过后文的介绍得知。The text duplication checking device is responsible for realizing the function of text duplication checking, and searching for relevant texts in the text management system. For example, the search scope is the text uploaded by all bidders in a certain bidding activity, and the duplication checking result is in the form of the similarity of any two texts. given. The detailed check method can be found in the following introduction.

另外,本申请实施例提供的文本管理系统还包括文本读取装置,该文本读取装置是一个利用本地文本编辑类工具的文档读取器,例如Word程序,通过启用一个静默的Word进程,来处理文本的读取和编辑等操作,本系统可以处理.doc和.docx格式的文件。这个文本读取装置可以借助本机已有的Word程序完成的文本的解码,减轻程序负荷。In addition, the text management system provided by the embodiment of the present application further includes a text reading device, and the text reading device is a document reader using a local text editing tool, such as a Word program, by enabling a silent Word process, to Handle operations such as reading and editing of text, this system can handle files in .doc and .docx formats. The text reading device can decode the text completed by the existing Word program of the machine, thereby reducing the program load.

通过对文本管理系统的总体设计,完成了对各实体及其属性的设计,确定企业招投标数据库的属性,能够指导完成招投标数据库构建的相关工作。本申请实施例提供的文本管理系统主要包括三个主功能装置,呈现为模块化的设计结构,整个文本管理系统构建思路清晰,结构明确简洁,能够方便地进行招投标活动的管理,确保招投标活动的文本分析工作高效率开展。Through the overall design of the text management system, the design of each entity and its attributes has been completed, and the attributes of the enterprise bidding database can be determined, which can guide the completion of the construction of the bidding database. The text management system provided by the embodiment of the present application mainly includes three main functional devices, which are presented as a modular design structure. The construction of the entire text management system is clear in thinking, clear and concise in structure, and can conveniently manage bidding activities and ensure bidding and bidding activities. The text analysis of the activity is carried out efficiently.

下面以具体实施例对本申请的文本查重的技术方案,以及本申请的技术方案如何解决高效率文本查重进行详细说明。The technical solution of the text duplication checking of the present application and how the technical solution of the present application solves the high-efficiency text duplication checking will be described in detail below with specific embodiments.

本申请一个方面的实施例提供了一种文本查重的方法,如图3所示,该文本查重方法至少包括如下步骤:An embodiment of an aspect of the present application provides a method for checking text duplication. As shown in FIG. 3 , the method for checking text duplication at least includes the following steps:

S310:获取若干份待查重文本。S310: Acquire a number of duplicate texts to be checked.

S320:根据词语数据库对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频。S320: Divide each word for repeated text to be checked according to the word database, and determine the word frequency of each word in the repeated text to be checked.

S330:根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重。S330: Determine the word vector and the vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word.

S340:根据向量权重,确定任意两篇待查重文本之间的重复率。S340: Determine the repetition rate between any two duplicate texts to be checked according to the vector weight.

本申请提供的文本查重方法通过词语数据库进行文本词语划分,确定文本中每个在词语数据库中出现的词语以及词语的词频,并根据词语及相应词频获得的词语向量计算词语的向量权重,然后通过向量权重的比较,确定两篇文本之间的重复率,该方法词语划分准确且高效率,因此能够高效率并且更精确地进行任意两篇文章的重复率比较。The text duplicate checking method provided by this application divides the text words through the word database, determines each word in the text that appears in the word database and the word frequency of the word, and calculates the vector weight of the word according to the word vector obtained by the word and the corresponding word frequency, and then Through the comparison of vector weights, the repetition rate between two texts is determined. This method divides words accurately and efficiently, so it can efficiently and more accurately compare the repetition rate of any two articles.

可行地,在本申请上述实施例的一种实现方式中,S310:根据词语数据库,对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频,如图4所示,该步骤具体包括:Feasible, in an implementation manner of the above-mentioned embodiment of the present application, S310: according to the word database, perform word division on each repeated text to be checked, and determine the word frequency of each word in the repeated text to be checked, as shown in FIG. 4 . This step specifically includes:

S311:根据预设语义策略,将待查重文本划分为若干词语。S311: Divide the text to be checked into several words according to a preset semantic strategy.

S312:若词语存在于词语数据库中,则确定词语为第一类词语;若词语不存在于词语数据库中,则确定词语为第二类词语。S312: If the word exists in the word database, determine that the word is a word of the first type; if the word does not exist in the word database, determine that the word is a word of the second type.

S313:确定每个第一类词语在待查重文本中的词频。S313: Determine the word frequency of each first-type word in the text to be checked.

常见的分词技术有基于字的全文索引或者基于词的全文索引,其中基于词的全文索引由于提高了分词的准确性,抽取关键词作为索引项,所以查重的准确率大幅度提高。然而,由于汉语言的特殊性,可能存在单个字为一个词语,也可能存在三个字或三个字以上为一个词语等多种词语组成情况。因此预设语义策略则是根据实现确定的语义划分规则,进行文本扫描分词。有的预设语义策略是以单个字为一个词,有的则是以两个字为一个词,如此类推。Common word segmentation techniques include word-based full-text indexing or word-based full-text indexing. Word-based full-text indexing improves the accuracy of word segmentation and extracts keywords as index items, so the accuracy of duplicate checking is greatly improved. However, due to the particularity of the Chinese language, there may be situations where a single character is a word, or there may be three or more characters that constitute a word and other words. Therefore, the preset semantic strategy is to perform text scanning and word segmentation according to the semantic division rules determined by the implementation. Some preset semantic strategies use a single character as a word, while others use two characters as a word, and so on.

由于标书是针对某一个项目主体产生的文本文件,这种文本文件本身存在大量规范化的文字,以及各文本在采购方案或者说工程建设方案的专业性语句方面存在大量的重复用词的情况。因此本申请提供的文本查重方法采用词语数据库对采用预设语义策略划分出的词语进行校正,通过该方法提高分词效率,并进而提高文本查重效率和准确率。词语数据库是一种行业词典,这种词语数据库包括有企业在长期招投标工作中统计出的足够多的关键词,并且该词语数据库还能够不断得到更新迭代,也能够通过人工编辑的方式实现更新迭代。Since the bidding document is a text file generated for a certain project subject, there are a lot of standardized words in this text file, and there are a lot of repeated words in each text in the professional statement of the procurement plan or engineering construction plan. Therefore, the text duplication checking method provided by the present application uses the word database to correct the words divided by the preset semantic strategy, and the word segmentation efficiency is improved by this method, thereby improving the text duplication checking efficiency and accuracy. The word database is an industry dictionary. This word database includes enough keywords that have been counted by enterprises in the long-term bidding work, and the word database can be continuously updated and iterated, and can also be updated by manual editing. iterate.

上述S310在执行时,根据某一预设语义策略对待查重文本进行词语划分后,该待查重文本以众多零散的词语的形式存在,这些词语当中会存在大量相同的词语,并且有些词语并非是本次招投标活动中需要的词语,例如对于“我司”这一内容,通过预设语义策略,可能划分为“我”和“司”两个具有独立含义的词语,而在词语数据库中并不存在“我”和“司”这两个词语,因此将这两个词语划分为第二类词语。而显然,运行文本查重的方法的系统会更改预设语义策略,将“我司”作为一个单独的词语,而在词语数据库中存在“我司”这一词语,则“我司”为第一类词语,并统计这一词语在整个文本中出现的次数,也即词频。When the above S310 is executed, after the text to be repeated is divided into words according to a preset semantic strategy, the text to be repeated exists in the form of many scattered words, and there will be a large number of the same words among these words, and some words are not. It is the word needed in this bidding activity. For example, for the content of "our company", through the preset semantic strategy, it may be divided into two words with independent meanings, "me" and "company", and in the word database The words "I" and "si" do not exist, so these two words are divided into the second category of words. Obviously, the system running the method of text duplication will change the preset semantic strategy, and use "our company" as a separate word, and if the word "our company" exists in the word database, then "our company" is the first A class of words, and count the number of times this word appears in the entire text, that is, word frequency.

通过不同预设语义策略的词语划分,能够得到不同的文本词语以及与词语对应的词频。为了提高词语划分效率,可实现查阅统计待使用的词语数据库中词语划分的语义策略类型,针对性地采用占比较大的语义策略作为预设语义策略,以避免反复进行预设语义策略的试探。Through the word division of different preset semantic strategies, different text words and word frequencies corresponding to the words can be obtained. In order to improve the efficiency of word division, it is possible to check and count the semantic strategy types of word division in the word database to be used, and use the semantic strategy with a larger proportion as the preset semantic strategy to avoid repeated attempts to preset semantic strategies.

可行的,在上述实施例的一种具体的实施方式中,S320中确定每个词语在待查重文本中的词频的步骤,具体包括:确定待查重文本的词语总数,以及每个词语的重复次数。根据词语总数和每个词语的重复次数,确定每个词语在待查重文本中的词频。Feasible, in a specific implementation of the above embodiment, the step of determining the word frequency of each word in the duplicate text to be checked in S320 specifically includes: determining the total number of words in the duplicate text to be checked, and the frequency of each word. repeat times. According to the total number of words and the number of repetitions of each word, the word frequency of each word in the text to be checked is determined.

将待查重文本通过词语划分,形成由众多词语组成的数据文件之后,能够统计出待查重文本含有的词语总数,这些词语中会有大量的重复词语,因此每个词语相对应地会存在一个重复次数。为了更客观地统计词语在文本中出现的频率,并进行文本之间的比较,可采用将词语的重复次数与词语总数做比的方式,得出一个相对词频,通过这一方式能够避免出现因标书的篇幅不同而导致重复结果相差较大的问题。After dividing the text to be checked by words to form a data file composed of many words, the total number of words contained in the text to be checked can be counted. There will be a large number of repeated words in these words, so each word will exist correspondingly. a number of repetitions. In order to more objectively count the frequency of words appearing in texts and compare them between texts, a relative word frequency can be obtained by comparing the number of repetitions of words with the total number of words. The length of the tenders is different, which leads to the problem that the repeated results are quite different.

可行的,在本申请上述实施例的一种实现方式中,S330:根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重,如图5所示,该步骤具体包括:Feasible, in an implementation manner of the above embodiment of the present application, S330: According to the word and the word frequency corresponding to the word, determine the word vector and the vector weight of the word vector in each text to be checked, as shown in Figure 5, This step specifically includes:

S331:根据待查重文本的全部词语,组建文本向量空间。S331: Construct a text vector space according to all words in the duplicate text to be checked.

S332:根据待查重文本的每个词语以及每个词语对应的词频,确定每个词语的词语向量。S332: Determine the word vector of each word according to each word of the repeated text to be checked and the word frequency corresponding to each word.

S333:根据文本向量空间和每个词语的词语向量,确定词语向量在文本向量空间中的权重。S333: Determine the weight of the word vector in the text vector space according to the text vector space and the word vector of each word.

为统计得到不同文本之间的重复率,需要对词语以及词语对应的词频进行处理,具体地,可采用SVM(Support Vector Machine,支持向量机)算法对划分出的待查重文本的全部词语的向量化。通过SVM算法,将待查重文本转化为一个特定的向量空间,该向量空间中每个向量由一个词语以及该词语对应的词频组成。前述的步骤中,S331和S332无需区分先后顺序。然后通过权值计算方法计算出向量空间中每一个向量的权值,常用的权值计算方法可通过TF-IDF(Term Frequency-Inverse Document Frequency,项频率反文档频率)函数进行。In order to obtain the repetition rate between different texts statistically, it is necessary to process the words and the word frequencies corresponding to the words. Vectorize. Through the SVM algorithm, the text to be checked is converted into a specific vector space, and each vector in the vector space is composed of a word and the word frequency corresponding to the word. In the foregoing steps, S331 and S332 do not need to distinguish the sequence. Then, the weight of each vector in the vector space is calculated by the weight calculation method. The commonly used weight calculation method can be performed by the TF-IDF (Term Frequency-Inverse Document Frequency, term frequency inverse document frequency) function.

可行的,在本申请上述实施例的一种实现方式中,S340:根据向量权重,确定任意两篇待查重文本的重复率,如图6所示,该步骤具体包括:Feasible, in an implementation manner of the above embodiment of the present application, S340: Determine the repetition rate of any two duplicate texts to be checked according to the vector weight, as shown in FIG. 6 , the step specifically includes:

S341:确定两篇待查重文本中的全部相同词语。S341: Determine all the same words in the two duplicate texts to be checked.

S342:根据相同词语在两篇待查重文档中的向量权重,确定相同词语在两篇待查重文档中的重合率。S342: Determine the coincidence rate of the same word in the two documents to be checked according to the vector weights of the same word in the two documents to be checked.

S343:根据全部相同词语的重合率,确定任意两篇待查重文本的重复率。S343: Determine the repetition rate of any two duplicate texts to be checked according to the coincidence rate of all the same words.

通过前述方法步骤,系统确定出每个文本向量空间中每一个词语向量的权值后,系统可具体按照余弦算法计算出某一篇文本与除本篇文本之外的每一篇文本的相似度。详细的计算过程可以为:Through the foregoing method steps, after the system determines the weight of each word vector in each text vector space, the system can specifically calculate the similarity between a certain text and each text except this text according to the cosine algorithm . The detailed calculation process can be as follows:

首先通过SVM模型将进行过划分的文本特化成特征向量V(d)=(t11(d);...;tns(d)),其中ti(i=1,2,…,s)为一列互不相同的词语项,s为正整数,ωi(d)为ti在文本向量空间d中的权值,一般被定义为ti在文本向量空间d中出现频率tfi(d)的函数,公式形式为:ωi(d)=ψ(tfi(d))。具体而言,ωi(d)通过公式(1)和公式(2)计算得到。然后通过余弦相似度公式(3)求得两篇待查重文本的相似度,即求解出两篇待查重文本的重复程度,也即重复率。First, the segmented text is specialized by SVM model into feature vector V(d)=(t 11 (d);...;t ns (d)), where t i (i=1 ,2,...,s) is a list of different terms, s is a positive integer, ω i (d) is the weight of t i in the text vector space d, which is generally defined as t i in the text vector space d The function of the frequency of occurrence tf i (d) in , the formula form is: ω i (d)=ψ(tf i (d)). Specifically, ω i (d) is calculated by formula (1) and formula (2). Then, the similarity of the two duplicate texts to be checked is obtained by the cosine similarity formula (3), that is, the degree of repetition of the two duplicate texts to be checked is obtained, that is, the repetition rate.

Figure BDA0002452608640000101
Figure BDA0002452608640000101

Figure BDA0002452608640000102
Figure BDA0002452608640000102

公式(1)和公式(2)中的N为所有文本的数目,ni为含有词条ti的文本数目。N in formula (1) and formula (2) is the number of all texts, and n i is the number of texts containing the entry t i .

Figure BDA0002452608640000103
余弦相似度公式(3),di表示第一篇文本对应的文本向量空间,dj表示第二篇文本对应的文本向量空间,m是文本向量空间中的词条数目。
Figure BDA0002452608640000103
Cosine similarity formula (3), d i represents the text vector space corresponding to the first text, d j represents the text vector space corresponding to the second text, and m is the number of entries in the text vector space.

基于同一发明构思,本申请另一方面的实施例提供了一种文本查重装置10,如图7所示,该文本查重装置10具体包括获取模块11、分词模块12、向量模块13和查重模块14。Based on the same inventive concept, another embodiment of the present application provides a text duplication checking apparatus 10. As shown in FIG. 7, the text duplication checking apparatus 10 specifically includes an acquisition module 11, a word segmentation module 12, a vector module 13 and a search Heavy module 14.

其中,获取模块11用于获取若干份待查重文本。Wherein, the obtaining module 11 is used to obtain several copies of duplicate texts to be checked.

分词模块12用于根据词语数据库,对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频。The word segmentation module 12 is configured to perform word division on each text to be repeated according to the word database, and determine the word frequency of each word in the text to be repeated.

向量模块13用于根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重。The vector module 13 is configured to determine the word vector and the vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word.

查重模块14用于根据向量权重,确定任意两篇待查重文本的重复率。The duplicate checking module 14 is configured to determine the repetition rate of any two duplicate texts to be checked according to the vector weight.

本申请提供的文本查重装置通过词语数据库进行文本词语划分,确定文本中每个在词语数据库中出现的词语以及词语的词频,并根据词语及相应词频获得的词语向量计算词语的向量权重,然后通过向量权重的比较,确定两篇文本之间的重复率,能够高效率并且更精确地进行任意两篇文章的重复率比较。The text checking device provided by the present application divides the text words through the word database, determines each word in the text that appears in the word database and the word frequency of the word, and calculates the vector weight of the word according to the word vector obtained by the word and the corresponding word frequency, and then Through the comparison of vector weights, the repetition rate between two texts can be determined, which can efficiently and more accurately compare the repetition rate of any two articles.

可行的,分词模块12根据词语数据库,对每个待查重文本进行词语划分,确定每个词语在待查重文本中的词频,包括:根据预设语义策略,将待查重文本划分为若干词语。若词语存在于词语数据库中,则确定词语为第一类词语。若词语不存在于词语数据库中,则确定词语为第二类词语。确定每个第一类词语在待查重文本中的词频。Feasible, the word segmentation module 12 performs word division on each repeated text to be checked according to the word database, and determines the word frequency of each word in the repeated text to be checked, including: dividing the repeated text to be checked into several parts according to a preset semantic strategy. words. If the word exists in the word database, the word is determined to be a first-class word. If the word does not exist in the word database, the word is determined to be a second-class word. Determine the word frequency of each first-class word in the text to be checked.

可行的,分词模块12确定每个词语在待查重文本中的词频,包括:确定待查重文本的词语总数,以及每个词语的重复次数。根据词语总数和每个词语的重复次数,确定每个词语在待查重文本中的词频。It is feasible that the word segmentation module 12 determines the word frequency of each word in the text to be repeated, including: determining the total number of words in the text to be repeated, and the number of repetitions of each word. According to the total number of words and the number of repetitions of each word, the word frequency of each word in the text to be checked is determined.

可行的,向量模块13根据词语以及词语对应的词频,确定词语向量以及词语向量在每个待查重文本中的向量权重,包括:根据待查重文本的全部词语,组建文本向量空间。根据待查重文本的每个词语以及每个词语对应的词频,确定每个词语的词语向量。根据文本向量空间和每个词语的词语向量,确定词语向量在文本向量空间中的权重。It is feasible that the vector module 13 determines the word vector and the vector weight of the word vector in each text to be checked according to the word and the word frequency corresponding to the word, including: constructing a text vector space according to all the words of the text to be checked. According to each word of the duplicate text to be checked and the word frequency corresponding to each word, the word vector of each word is determined. According to the text vector space and the word vector of each word, determine the weight of the word vector in the text vector space.

可行的,查重模块14根据向量权重,确定任意两篇待查重文本的重复率,包括:确定两篇待查重文本中的全部相同词语。根据相同词语在两篇待查重文档中的向量权重,确定相同词语在两篇待查重文档中的重合率。根据全部相同词语的重合率,确定两篇待查重文本的重复率。It is feasible that the duplicate checking module 14 determines the repetition rate of any two duplicate texts to be checked according to the vector weight, including: determining all the same words in the two duplicate texts to be checked. According to the vector weights of the same words in the two documents to be checked, the coincidence rate of the same words in the two documents to be checked is determined. According to the coincidence rate of all the same words, the repetition rate of the two duplicate texts to be checked is determined.

基于同一发明构思,本申请实施例提供了一种电子设备,该电子设备,包括:存储器和处理器。Based on the same inventive concept, an embodiment of the present application provides an electronic device, which includes: a memory and a processor.

存储器与处理器电连接。The memory is electrically connected to the processor.

至少一个计算机程序,存储于存储器中,用于被处理器执行时,实现本申请实施例提供的任一文本查重的方法/实现本申请实施例提供的文本查重的方法的各种可选实施方式。At least one computer program, stored in the memory, is used to implement any text duplication checking method provided by the embodiments of the present application/various optional methods for implementing the text duplication checking methods provided by the embodiments of the present application when executed by the processor implementation.

本技术领域技术人员可以理解,本申请实施例提供的电子设备可以为所需的目的而专门设计和制造,或者也可以包括通用计算机中的已知设备。这些设备具有存储在其内的计算机程序,这些计算机程序选择性地激活或重构。这样的计算机程序可以被存储在设备(例如,计算机)可读介质中或者存储在适于存储电子指令并分别耦联到总线的任何类型的介质中。Those skilled in the art may understand that the electronic device provided by the embodiments of the present application may be specially designed and manufactured for the required purpose, or may also include known devices in a general-purpose computer. These devices have computer programs stored in them that are selectively activated or reconfigured. Such a computer program may be stored in a device (eg, computer) readable medium or in any type of medium suitable for storing electronic instructions and separately coupled to a bus.

与现有技术相比,本申请提供的电子设备能够高效率并且更精确地进行任意两篇文章的重复率比较。Compared with the prior art, the electronic device provided by the present application can compare the repetition rate of any two articles with high efficiency and accuracy.

本申请在一个可选实施例中提供了一种电子设备,如图8所示,图8所示的电子设备1000包括:处理器1001和存储器1003。其中,处理器1001和存储器1003相电连接,如通过总线1002相连。The present application provides an electronic device in an optional embodiment. As shown in FIG. 8 , the electronic device 1000 shown in FIG. 8 includes: a processor 1001 and a memory 1003 . Wherein, the processor 1001 and the memory 1003 are electrically connected, such as through a bus 1002 .

处理器1001可以是CPU(Central Processing Unit,中央处理器),通用处理器,DSP(Digital Signal Processor,数据信号处理器),ASIC(Application SpecificIntegrated Circuit,专用集成电路),FPGA(Field-Programmable Gate Array,现场可编程门阵列)或者其他可编程逻辑器件、晶体管逻辑器件、硬件部件或者其任意组合。其可以实现或执行结合本申请公开内容所描述的各种示例性的逻辑方框,模块和电路。处理器1001也可以是实现计算功能的组合,例如包含一个或多个微处理器组合,DSP和微处理器的组合等。The processor 1001 may be a Central Processing Unit (CPU), a general-purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), or an FPGA (Field-Programmable Gate Array). , field programmable gate array) or other programmable logic devices, transistor logic devices, hardware components, or any combination thereof. It may implement or execute the various exemplary logical blocks, modules and circuits described in connection with this disclosure. The processor 1001 may also be a combination that implements computing functions, such as a combination of one or more microprocessors, a combination of a DSP and a microprocessor, and the like.

总线1002可包括一通路,在上述组件之间传送信息。总线1002可以是PCI(Peripheral Component Interconnect,外设部件互连标准)总线或EISA(ExtendedIndustry Standard Architecture,扩展工业标准结构)总线等。总线1002可以分为地址总线、数据总线、控制总线等。为便于表示,图8中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。The bus 1002 may include a path to communicate information between the aforementioned components. The bus 1002 may be a PCI (Peripheral Component Interconnect, Peripheral Component Interconnect) bus or an EISA (Extended Industry Standard Architecture, Extended Industry Standard Architecture) bus or the like. The bus 1002 can be divided into an address bus, a data bus, a control bus, and the like. For ease of presentation, only one thick line is used in FIG. 8, but it does not mean that there is only one bus or one type of bus.

存储器1003可以是ROM(Read-Only Memory,只读存储器)或可存储静态信息和指令的其他类型的静态存储设备,RAM(random access memory,随机存取存储器)或者可存储信息和指令的其他类型的动态存储设备,也可以是EEPROM(Electrically ErasableProgrammable Read Only Memory,电可擦可编程只读存储器)、CD-ROM(Compact DiscRead-Only Memory,只读光盘)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。The memory 1003 can be a ROM (Read-Only Memory, read only memory) or other types of static storage devices that can store static information and instructions, a RAM (random access memory, random access memory) or other types that can store information and instructions It can also be an EEPROM (Electrically Erasable Programmable Read Only Memory, Electrically Erasable Programmable Read Only Memory), CD-ROM (Compact Disc Read-Only Memory, CD Read Only Memory) or other CD storage, CD storage (including compressed compact disc, laser disc, compact disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage medium or other magnetic storage device, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being accessed by a computer any other medium, but not limited to this.

可选地,电子设备1000还可以包括收发器1004。收发器1004可用于信号的接收和发送。收发器1004可以允许电子设备1000与其他设备进行无线或有线通信以交换数据。需要说明的是,实际应用中收发器1004不限于一个。Optionally, the electronic device 1000 may also include a transceiver 1004 . The transceiver 1004 may be used for the reception and transmission of signals. Transceiver 1004 may allow electronic device 1000 to communicate wirelessly or by wire with other devices to exchange data. It should be noted that, in practical applications, the transceiver 1004 is not limited to one.

可选地,电子设备1000还可以包括输入单元1005。输入单元1005可用于接收输入的数字、字符、图像和/或声音信息,或者产生与电子设备1000的用户设置以及功能控制有关的键信号输入。输入单元1005可以包括但不限于触摸屏、物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆、拍摄装置、拾音器等中的一种或多种。Optionally, the electronic device 1000 may further include an input unit 1005 . The input unit 1005 may be used to receive input numbers, characters, images and/or sound information, or generate key signal input related to user settings and function control of the electronic device 1000 . The input unit 1005 may include, but is not limited to, one or more of a touch screen, a physical keyboard, function keys (such as volume control keys, switch keys, etc.), trackball, mouse, joystick, camera, pickup, and the like.

可选地,电子设备1000还可以包括输出单元1006。输出单元1006可用于输出或展示经过处理器1001处理的信息。输出单元1006可以包括但不限于显示装置、扬声器、振动装置等中的一种或多种。Optionally, the electronic device 1000 may further include an output unit 1006 . The output unit 1006 may be used to output or display the information processed by the processor 1001 . The output unit 1006 may include, but is not limited to, one or more of a display device, a speaker, a vibration device, and the like.

虽然图8示出了具有各种装置的电子设备1000,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Although FIG. 8 shows the electronic device 1000 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.

可选的,存储器1003用于存储执行本申请方案的应用程序代码,并由处理器1001来控制执行。处理器1001用于执行存储器1003中存储的应用程序代码,以实现本申请实施例提供的任一种文本查重的方法。Optionally, the memory 1003 is used to store the application code for executing the solution of the present application, and the execution is controlled by the processor 1001 . The processor 1001 is configured to execute the application program code stored in the memory 1003, so as to implement any method for checking text duplicates provided by the embodiments of the present application.

基于同一的发明构思,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器执行时实现本申请实施例所示的任一文本查重的方法/实现本申请实施例提供的文本查重的方法的各种可选实施方式。Based on the same inventive concept, the embodiments of the present application provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, any one of the embodiments of the present application is implemented. Method for Checking Text Duplication/Various optional implementation manners for implementing the method for checking text duplication provided in the embodiments of the present application.

本申请实施例提供了一种计算机可读存储介质,与现有技术相比,本申请提供的计算机可读存储介质能够通过其中存储的计算机程序,高效率并且更精确地进行任意两篇文章的重复率比较。The embodiments of the present application provide a computer-readable storage medium. Compared with the prior art, the computer-readable storage medium provided by the present application can efficiently and more accurately perform the analysis of any two articles through the computer program stored therein. Comparison of repetition rates.

本技术领域技术人员可以理解,本申请中已经讨论过的各种操作、方法、流程中的步骤、措施、方案可以被交替、更改、组合或删除。进一步地,具有本申请中已经讨论过的各种操作、方法、流程中的其他步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。进一步地,现有技术中的具有与本申请中公开的各种操作、方法、流程中的步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。Those skilled in the art can understand that various operations, methods, steps, measures, and solutions in the process discussed in this application may be alternated, modified, combined or deleted. Further, other steps, measures, and solutions in various operations, methods, and processes that have been discussed in this application may also be alternated, modified, rearranged, decomposed, combined, or deleted. Further, steps, measures and solutions in the prior art with various operations, methods, and processes disclosed in this application may also be alternated, modified, rearranged, decomposed, combined or deleted.

术语“第一”、“第二”仅用于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征。在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。The terms "first" and "second" are only used for descriptive purposes, and should not be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may expressly or implicitly include one or more of that feature. In the description of this application, unless stated otherwise, "plurality" means two or more.

在本申请的描述中,需要说明的是,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本申请中的具体含义。In the description of this application, it should be noted that, unless otherwise expressly specified and limited, the terms "installed", "connected" and "connected" should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection Connection, or integral connection; it can be directly connected, or indirectly connected through an intermediate medium, and it can be the internal communication of two elements. For those of ordinary skill in the art, the specific meanings of the above terms in this application can be understood in specific situations.

在本说明书的描述中,具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。In the description of this specification, the particular features, structures, materials or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。The above are only part of the embodiments of the present application. It should be pointed out that for those skilled in the art, without departing from the principles of the present application, several improvements and modifications can also be made. It should be regarded as the protection scope of this application.

Claims (10)

1.一种文本查重的方法,其特征在于,包括:1. a method for text checking, is characterized in that, comprises: 获取若干份待查重文本;Get a number of duplicate texts to be checked; 根据词语数据库对每个所述待查重文本进行词语划分,确定每个词语在所述待查重文本中的词频;According to the word database, word division is performed on each of the text to be checked, and the word frequency of each word in the text to be checked is determined; 根据所述词语以及所述词语对应的词频,确定词语向量以及所述词语向量在每个所述待查重文本中的向量权重;According to the word and the word frequency corresponding to the word, determine the word vector and the vector weight of the word vector in each of the texts to be checked; 根据所述向量权重,确定任意两篇所述待查重文本之间的重复率。According to the vector weight, the repetition rate between any two of the to-be-checked duplicate texts is determined. 2.根据权利要求1所述的文本查重的方法,其特征在于,所述根据词语数据库,对每个所述待查重文本进行词语划分,确定每个词语在所述待查重文本中的词频,包括:2. The method for checking duplicate texts according to claim 1, wherein, according to a word database, each of the duplicate texts to be checked is divided into words, and it is determined that each word is in the duplicate texts to be checked. word frequencies, including: 根据预设语义策略,将所述待查重文本划分为若干词语;According to a preset semantic strategy, the to-be-checked text is divided into several words; 若所述词语存在于所述词语数据库中,则确定所述词语为第一类词语;If the word exists in the word database, determining that the word is a first-class word; 若所述词语不存在于所述词语数据库中,则确定所述词语为第二类词语;If the word does not exist in the word database, determining that the word is a second type of word; 确定每个所述第一类词语在所述待查重文本中的词频。Determine the word frequency of each of the first type of words in the to-be-checked text. 3.根据权利要求1所述的文本查重的方法,其特征在于,所述确定每个词语在所述待查重文本中的词频,包括:3. The method for text duplication checking according to claim 1, wherein the determining the word frequency of each word in the text to be checked for duplication comprises: 确定所述待查重文本的词语总数,以及每个词语的重复次数;Determine the total number of words in the duplicate text to be checked, and the number of repetitions of each word; 根据所述词语总数和每个词语的重复次数,确定每个词语在所述待查重文本中的词频。According to the total number of words and the number of repetitions of each word, the word frequency of each word in the text to be checked is determined. 4.根据权利要求1所述的文本查重的方法,其特征在于,所述根据所述词语以及所述词语对应的词频,确定词语向量以及所述词语向量在每个所述待查重文本中的向量权重,包括:4 . The method for checking text duplicates according to claim 1 , wherein, according to the word and the word frequency corresponding to the word, the word vector and the word vector are determined in each of the texts to be checked for duplicates. 5 . vector weights in , including: 根据所述待查重文本的全部词语,组建文本向量空间;According to all the words of the repeated text to be checked, a text vector space is formed; 根据所述待查重文本的每个词语以及每个所述词语对应的词频,确定每个所述词语的词语向量;Determine the word vector of each of the words according to each word of the to-be-checked text and the word frequency corresponding to each of the words; 根据所述文本向量空间和每个所述词语的词语向量,确定所述词语向量在所述文本向量空间中的权重。According to the text vector space and the word vector of each of the words, the weight of the word vector in the text vector space is determined. 5.根据权利要求1所述的文本查重的方法,其特征在于,所述根据所述向量权重,确定任意两篇所述待查重文本的重复率,包括:5. The method for text duplication checking according to claim 1, wherein, according to the vector weight, determining the repetition rate of any two of the duplicate texts to be checked, comprising: 确定两篇待查重文本中的全部相同词语;Determine all the same words in the two duplicate texts to be checked; 根据相同词语在两篇待查重文档中的向量权重,确定相同词语在两篇待查重文档中的重合率;According to the vector weights of the same words in the two documents to be checked, the coincidence rate of the same words in the two documents to be checked is determined; 根据全部所述相同词语的重合率,确定任意两篇待查重文本的重复率。According to the coincidence rate of all the same words, the repetition rate of any two duplicate texts to be checked is determined. 6.一种文本查重装置,其特征在于,包括:6. A text checking device, characterized in that, comprising: 获取模块,用于获取若干份待查重文本;The acquisition module is used to acquire several duplicate texts to be checked; 分词模块,用于根据词语数据库,对每个所述待查重文本进行词语划分,确定每个词语在所述待查重文本中的词频;A word segmentation module, configured to perform word division on each of the repeated texts to be checked according to a word database, and determine the word frequency of each word in the repeated texts to be checked; 向量模块,用于根据所述词语以及所述词语对应的词频,确定词语向量以及所述词语向量在每个所述待查重文本中的向量权重;a vector module, configured to determine the word vector and the vector weight of the word vector in each of the texts to be checked according to the word and the word frequency corresponding to the word; 查重模块,用于根据所述向量权重,确定任意两篇所述待查重文本的重复率。A duplicate checking module, configured to determine the repetition rate of any two of the duplicate texts to be checked according to the vector weight. 7.一种文本管理方法,其特征在于,包括:7. A method for text management, comprising: 获取身份验证信息,并确定身份验证结果;Obtain authentication information and determine the authentication result; 根据所述身份验证结果,处理与所述身份验证信息对应的文本;According to the identity verification result, process the text corresponding to the identity verification information; 根据如权利要求1~5中任一项所述的文本查重的方法,对文本进行查重。According to the method for checking text duplication according to any one of claims 1 to 5, the text is checked for duplication. 8.一种文本管理系统,其特征在于,包括:8. A text management system, comprising: 账号管理装置,用于获取身份验证信息,并确定身份验证结果;The account management device is used to obtain the identity verification information and determine the identity verification result; 文件传输装置,用于根据所述身份验证结果,处理与所述身份验证信息对应的文本;a file transmission device, configured to process the text corresponding to the identity verification information according to the identity verification result; 文本查重装置,用于根据如权利要求1~5中任一项所述的文本查重的方法,对文本进行查重。A text duplication checking device is used for checking the text for duplication according to the method for checking text duplication according to any one of claims 1 to 5. 9.一种电子设备,其特征在于,包括:9. An electronic device, characterized in that, comprising: 处理器;processor; 存储器,与所述处理器电连接;a memory, electrically connected to the processor; 至少一个程序,被存储在所述存储器中并被配置为由所述处理器执行,所述至少一个程序被配置用于:实现如权利要求1~5中任一项所述的文本查重的方法。at least one program, stored in the memory and configured to be executed by the processor, the at least one program being configured to: implement the text duplication check according to any one of claims 1 to 5 method. 10.一种计算机可读存储介质,其特征在于,所述计算机存储介质用于存储计算机指令,当所述计算机指令在计算机上运行时,实现如权利要求1~5中任一项所述的文本查重的方法。10 . A computer-readable storage medium, wherein the computer storage medium is used to store computer instructions, and when the computer instructions are executed on a computer, the computer instructions according to any one of claims 1 to 5 are implemented. 11 . A method of text duplication checking.
CN202010297125.0A 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment Pending CN111539196A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010297125.0A CN111539196A (en) 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010297125.0A CN111539196A (en) 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment

Publications (1)

Publication Number Publication Date
CN111539196A true CN111539196A (en) 2020-08-14

Family

ID=71978613

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010297125.0A Pending CN111539196A (en) 2020-04-15 2020-04-15 Text duplicate checking method and device, text management system and electronic equipment

Country Status (1)

Country Link
CN (1) CN111539196A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251445A (en) * 2023-10-11 2023-12-19 杭州今元标矩科技有限公司 Deep learning-based CRM data screening method, system and medium
CN118260385A (en) * 2024-04-12 2024-06-28 广东万方数据信息科技有限公司 Thesis duplicate checking system and method based on text feature extraction technology
CN119830895A (en) * 2024-12-24 2025-04-15 中国—东盟信息港股份有限公司 Massive text duplicate checking method, system, equipment and storage medium in real scene

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Text matching method and device
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110390084A (en) * 2019-06-19 2019-10-29 平安国际智慧城市科技股份有限公司 Text duplicate checking method, apparatus, equipment and storage medium
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102411583A (en) * 2010-09-20 2012-04-11 阿里巴巴集团控股有限公司 Text matching method and device
CN108536677A (en) * 2018-04-09 2018-09-14 北京信息科技大学 A kind of patent text similarity calculating method
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN109885813A (en) * 2019-02-18 2019-06-14 武汉瓯越网视有限公司 A kind of operation method, system, server and the storage medium of the text similarity based on word coverage
CN110390084A (en) * 2019-06-19 2019-10-29 平安国际智慧城市科技股份有限公司 Text duplicate checking method, apparatus, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
王冲 等: "现代信息检索技术基本原理教程", vol. 2013, 西安电子科技大学出版社, pages: 103 - 104 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251445A (en) * 2023-10-11 2023-12-19 杭州今元标矩科技有限公司 Deep learning-based CRM data screening method, system and medium
CN117251445B (en) * 2023-10-11 2024-06-04 杭州今元标矩科技有限公司 Deep learning-based CRM data screening method, system and medium
CN118260385A (en) * 2024-04-12 2024-06-28 广东万方数据信息科技有限公司 Thesis duplicate checking system and method based on text feature extraction technology
CN119830895A (en) * 2024-12-24 2025-04-15 中国—东盟信息港股份有限公司 Massive text duplicate checking method, system, equipment and storage medium in real scene

Similar Documents

Publication Publication Date Title
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
CN117633518B (en) Industrial chain construction method and system
CN110321285A (en) Test case processing method and relevant device
Liu et al. Has this bug been reported?
CN113704236A (en) Government affair system data quality evaluation method, device, terminal and storage medium
CN111539196A (en) Text duplicate checking method and device, text management system and electronic equipment
CN115344674B (en) Question answering method and device and electronic equipment
CN112149387A (en) Visualization method and device for financial data, computer equipment and storage medium
US12530722B2 (en) Asset value evaluation method and apparatus, model training method and apparatus, and readable storage medium
CN111427544A (en) Software requirement document generation method and device, storage medium and electronic equipment
CN115617978A (en) Index name retrieval method, device, electronic equipment and storage medium
CN110069594B (en) Contract confirmation method, contract confirmation device, electronic equipment and storage medium
CN112948545A (en) Duplicate checking method, terminal equipment and computer readable storage medium
CN113626655A (en) Method, computer equipment and storage device for extracting information in file
CN117421333A (en) Enterprise document library construction and retrieval method and system
CN111666207A (en) Crowdsourcing test task selection method and electronic device
CN116402166A (en) Training method, device, electronic equipment and storage medium for a prediction model
CN120723807A (en) Text2SQL medical data processing method and electronic device based on large language model
CN111597791A (en) Method and device for extracting comment phrases
CN119621561A (en) An automated testing method for RPC APIs in large industrial systems
CN112100216A (en) Method and device for processing creative keywords
CN119378672A (en) Method, device, equipment and medium for constructing question-answer knowledge database for professional fields
CN117112727A (en) Large language model fine tuning instruction set construction method suitable for cloud computing service
CN116595161A (en) Method, device, equipment and storage medium for recommending items in government digitalization scenarios
CN114357966A (en) Target object scoring method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200814

RJ01 Rejection of invention patent application after publication