CN119808752A

CN119808752A - Document comparison and tracing method, device and computer storage medium

Info

Publication number: CN119808752A
Application number: CN202411715268.3A
Authority: CN
Inventors: 肖思源; 李泽远; 廖静; 龙喜洋
Original assignee: China Merchants Finance Technology Co Ltd
Current assignee: China Merchants Finance Technology Co Ltd
Priority date: 2024-11-27
Filing date: 2024-11-27
Publication date: 2025-04-11

Abstract

Responding to a tracing request for a difference fragment, determining the range of the difference fragment in an original document according to the difference fragment and the difference position corresponding to the difference fragment, and identifying the text in the range as a target text; dividing a target text into independent text fragments according to separators, comparing the text fragments with the difference fragments to determine the similarity between each text fragment and the difference fragment, taking the text fragment with the similarity reaching or exceeding a preset threshold as a candidate fragment, determining the text fragment with the highest similarity from the candidate fragments as a tracing fragment, and returning the position coordinates of the tracing fragment in an original document. According to the application, the tracing analysis range is effectively reduced and the accuracy of document comparison tracing is improved by receiving the tracing request of the difference fragments and the difference positions thereof and positioning the difference range.

Description

Document comparison tracing method, device and computer storage medium

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a document comparison and tracing method, device and computer storage medium.

Background

In the fields of document processing and natural language processing, large models perform well in understanding and generating text, but lack an effective traceability mechanism when performing document comparison tasks.

Specifically, current technical frameworks, such as RAG (RETRIEVAL-augmented Generation, search enhancement generation) framework-based methods, incorporate information search and generation models. In the information retrieval stage, the method needs to segment and store the reference document into a vector library, and then matches the similarity between the query and the document fragments by using a vector retrieval technology to locate the most relevant document fragment.

However, in a document-contrast scenario, it is often necessary to compare and identify the overall difference between two documents, rather than finding segments that match a single query. The RAG framework is designed primarily to enhance the answer ability of the generated model by retrieving relevant document snippets, which is more focused on the retrieval and utilization of local information than on global document comparison and difference analysis. Thus, when attempting to apply the RAG framework to the overall comparison between documents, a difference fragment that does not match the original fragment may be output. Therefore, when performing document comparison tasks, the RAG framework has difficulty ensuring accuracy of differential tracing.

Disclosure of Invention

The application mainly aims to provide an application name and aims to solve the technical problem of how to improve the accuracy of document comparison and tracing.

In order to achieve the above object, an embodiment of the present application provides a document comparative tracing method, including:

Responding to a tracing request for a difference fragment, determining the range of the difference fragment in an original document according to the difference fragment and the difference position corresponding to the difference fragment, and identifying the text in the range as a target text;

Splitting the target text into independent text fragments according to separators, and comparing the text fragments with the difference fragments to determine the similarity between each text fragment and each difference fragment;

And taking the text segment with the similarity reaching or exceeding a preset threshold value as a candidate segment, determining the text segment with the highest similarity from the candidate segments as a tracing segment, and returning the position coordinates of the tracing segment in the original document.

In an embodiment, the step of responding to the tracing request for the difference segment, determining a range of the difference segment in the original document according to the difference segment and a difference position corresponding to the difference segment, and identifying a text in the range as a target text includes:

receiving a comparison instruction, wherein the comparison instruction comprises comparison parameters and an original document, and identifying a difference point of the original document;

extracting continuous texts containing the difference points as difference fragments, and determining the difference position of each difference fragment in the corresponding original document according to the positioning identification of the original document;

and combining the difference fragments with the difference positions corresponding to the difference fragments to generate a tracing request.

In an embodiment, the step of responding to the tracing request for the difference segment, determining a range of the difference segment in the original document according to the difference segment and the difference position corresponding to the difference segment, and identifying the text in the range as the target text further includes:

positioning corresponding pages or paragraphs in the original document according to the difference positions;

and extracting all text contents containing the difference positions or the difference fragments from the original document as target text.

In an embodiment, after the step of determining a range of the difference fragment in the original document according to the difference fragment and the difference position corresponding to the difference fragment and identifying the text in the range as the target text in response to the tracing request for the difference fragment, the method includes:

creating a PDF analysis instance, and transmitting the difference fragment and the target text to the PDF analysis instance as input parameters;

searching and matching the target text by utilizing the PDF analysis example, and determining the position information and the original text segment corresponding to the difference segment;

And matching each difference segment with the corresponding position information and the corresponding original text segment to generate a tracing result.

In an embodiment, the step of determining the location information and the original text segment corresponding to the difference segment includes:

identifying the layout structure of the target text through the PDF analysis example, and marking all text units in the target text;

matching the difference segment with each text unit, and determining a matching result of the difference segment and each text unit;

And based on the matching result, taking the text unit successfully matched as an original text segment of the difference segment, and determining the position information of the original text segment in the target text.

In one embodiment, the step of splitting the target text into separate text segments according to separators, and comparing the text segments with the difference segments, and determining the similarity between each text segment and the difference segment includes:

identifying a separator in the target text, and splitting the target text into independent text fragments by using the separator;

converting each text segment and each difference segment into a character sequence, and comparing the character sequence of the text segment and the character sequence of the difference segment character by character;

Counting the number of character matching and the number of character mismatching of the text segment and the character sequence of the difference segment at the same position;

And calculating the similarity between each text segment and the difference segment based on the number of character matches and the number of character mismatches.

In an embodiment, the step of using the text segment with the similarity reaching or exceeding a preset threshold as a candidate segment, determining the text segment with the highest similarity from the candidate segments as a tracing segment, and returning the position coordinates of the tracing segment in the original document includes:

Screening out the text fragments with the similarity reaching or exceeding a first preset threshold value from the target text, and taking the text fragments as first candidate fragments;

And based on the first candidate segment, taking the text segment with the similarity reaching or exceeding a second preset threshold value as a second candidate segment, determining the text segment with the highest similarity in the second candidate segment as the tracing segment, and returning the position coordinates of the tracing segment in the original document.

In an embodiment, the step of using the text segment with the similarity reaching or exceeding a second preset threshold as a second candidate segment based on the first candidate segment, determining the text segment with the highest similarity in the second candidate segment as the tracing segment, and returning the position coordinates of the tracing segment in the original document further includes:

If the text segment with the similarity reaching or exceeding a second preset threshold value does not exist, vector representations of the target text and the difference segment are respectively obtained;

calculating semantic similarity between each text segment in the target text and the difference segment;

and determining a text segment with the highest semantic similarity with the difference segment as the tracing segment in the target text, and returning the position coordinates of the tracing segment in the original document.

The embodiment of the application also provides a document comparison tracing device, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the computer program is configured to realize the steps of the document comparison tracing method.

The embodiment of the application also provides a computer storage medium, which is a computer readable storage medium, and a computer program is stored on the computer storage medium, and when the computer program is executed by a processor, the steps of the document comparison tracing method are realized.

The embodiment of the application discloses a document comparison tracing method, which comprises the steps of responding to tracing requests for difference fragments, determining the range of the difference fragments in an original document according to the difference fragments and the difference positions corresponding to the difference fragments, identifying texts in the range as target texts, splitting the target texts into independent text fragments according to separators, comparing the text fragments with the difference fragments, determining the similarity between each text fragment and each difference fragment, taking the text fragment with the similarity reaching or exceeding a preset threshold as a candidate fragment, determining the text fragment with the highest similarity from the candidate fragments as a tracing fragment, and returning the position coordinates of the tracing fragment in the original document. According to the method and the device, the accurate positioning of the target text is realized by receiving the tracing request containing the difference fragments and the positions thereof, so that the tracing analysis range is narrowed. On the basis, the similarity between each text segment and the difference segment is determined by comparing and analyzing each text segment and the difference segment in the target text, and the text segment with the highest similarity is obtained and is used as the tracing segment. Therefore, the method and the device accurately return the original text fragments and the specific coordinates thereof of the difference fragments in the original document based on the tracing fragments, and remarkably improve the accuracy of document comparison tracing.

Drawings

FIG. 1 is a schematic flow chart of a first embodiment of a document comparative tracing method according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a second embodiment of a document comparison tracing method according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a third embodiment of a document comparison tracing method according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a fourth embodiment of a document comparative tracing method according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of the document comparison tracing device of the present application.

The achievement of the objects, functional features and advantages of the present application will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

In order to solve the above-mentioned drawbacks in the related art, an embodiment of the present application provides a document comparison tracing method, which includes determining a range of a difference fragment in an original document according to the difference fragment and a difference position corresponding to the difference fragment in response to a tracing request for the difference fragment, identifying a text within the range as a target text, splitting the target text into independent text fragments according to separators, comparing the text fragments with the difference fragment, determining a similarity between each text fragment and the difference fragment, using the text fragment with the similarity reaching or exceeding a preset threshold as a candidate fragment, determining the text fragment with the highest similarity from the candidate fragments as a tracing fragment, and returning a position coordinate of the tracing fragment in the original document. According to the method and the device, the accurate positioning of the target text is realized by receiving the tracing request containing the difference fragments and the positions thereof, so that the tracing analysis range is narrowed. On the basis, the similarity between each text segment and the difference segment is determined by comparing and analyzing each text segment and the difference segment in the target text, and the text segment with the highest similarity is obtained and is used as the tracing segment. Therefore, the method and the device accurately return the original text fragments and the specific coordinates thereof of the difference fragments in the original document based on the tracing fragments, and remarkably improve the accuracy of document comparison tracing.

It should be noted that, the execution body of the embodiment may be a question-answering system, or may be a computing service device with functions of data processing, network communication and program running, such as a tablet computer, a personal computer, a mobile phone, or a document comparison traceability device capable of implementing the above functions. Hereinafter, this embodiment and the following embodiments will be described with reference to a question-answering system (hereinafter, simply referred to as "system").

Referring to fig. 1, the document comparison tracing method of the first embodiment of the present application includes steps S10 to S30:

And step S10, responding to a tracing request for the difference fragment, determining the range of the difference fragment in the original document according to the difference fragment and the difference position corresponding to the difference fragment, and identifying the text in the range as a target text.

In this embodiment, the tracing request refers to an instruction initiated by a user or a system and aimed at determining the source of the difference point of the original document by the tracing method. The difference fragments refer to inconsistent parts in two original documents to be compared, and the difference fragments are the basis for tracing analysis. The original document may be in PDF format, word format, text format, or image format, etc. The difference positions indicate the specific positions of the difference fragments in the original document, so that the scope of traceability analysis is reduced.

Before receiving the traceability request, receiving a comparison instruction, wherein the comparison instruction comprises comparison parameters and an original document, and identifying a difference point of the original document. And extracting continuous texts containing the difference points as difference fragments, and determining the difference position of each difference fragment in the corresponding original document according to the positioning identification of the original document. And combining the difference fragments and the difference positions corresponding to the difference fragments to generate a tracing request.

Optionally, a general prompt engineering template is constructed, and a comparison instruction is input into the large model according to the prompt engineering template. The comparison instruction includes, but is not limited to, two original documents to be compared and comparison parameters, such as a comparison range, a difference point original text segment, a difference point position, an output format, a word number requirement and the like. After receiving the comparison instruction, the large model analyzes the content of the comparison instruction and acquires the comparison requirement of the original document to be compared. And then identifying the difference points of the original document, extracting continuous texts containing the difference points from the original document as difference fragments, acquiring difference position information of the corresponding difference fragments, and generating a comparison answer containing the difference fragments and the corresponding difference positions of the difference fragments. And finally, generating a tracing request according to the comparison answer. The comparison answer contains basic information required for performing the traceability analysis, such as the text content of the difference fragment, the difference position of the difference fragment in the original document, etc.

Note that a difference fragment refers to a portion of text identified in two original documents where there is a difference point. These difference fragments may differ from the original fragments in the original because the difference fragments were extracted to highlight the differences. The difference positions point to specific positions of the difference fragments in the original document, including page numbers, chapter marks and the like, so that when the traceability analysis is performed, the approximate range of each difference fragment can be positioned in the original document.

Illustratively, assuming that a prompt engineering template exists, the template content is "the text of document 1 is as follows: the text of abc document 2 is as follows: abc123, please indicate the difference between document 1 and document 2. For each difference point, please provide the original text segment and the corresponding position of the difference point in the document 1 and the document 2, respectively. And giving a comparison instruction according to the prompt engineering template and inputting the comparison instruction into a large model, wherein the comparison answer output by the large model is that the difference fragment of the document 1 is 'I are college students' and the difference position is on page 2, and the difference fragment of the document 2 is 'I are research students' and the difference position is on page 2. Based on the comparison question and answer, the system generates a traceability request, executes traceability analysis according to the difference fragments and the position information, and determines the accurate position of each difference fragment in the original band comparison document, wherein the accurate position comprises page numbers, paragraph identifications and line numbers.

Alternatively, the original document is submitted directly through a user interface or text content of the original document is pasted, and text corresponding to the difference segments and difference positions in the original document is input. Based on user input, the system automatically generates a traceability request, and then the traceability request is processed by the traceability analysis component. The traceability request contains the original document submitted by the user, the text content of the difference fragment and the position information of the difference fragment in the original document.

In this embodiment, according to the determined difference position, the entire text content, such as a page or a paragraph, in the range including the difference position or the difference fragment is located in the original document according to the range of the difference position, and the entire text content in the range is extracted from the original document as the target text, so that the range of the traceability analysis is narrowed, and the traceability analysis can be focused on the page text content related to the difference fragment.

Optionally, the document content is read page by a PDF (Portable Document Format ) parsing tool and located to a specific page range containing the difference segments according to the read page number information. The document object model may also be accessed through a COM (Component Object Model ) automation interface, or a Python-docx library (a document processing library in the Python programming language), to locate a particular chapter or paragraph containing a differential segment based on the differential location, such as a paragraph number, bookmark, etc. All text is extracted from the specific section or paragraph located and used as target text, ensuring that the target text contains the complete context of the difference segment.

Optionally, the original document may be preprocessed using optical character recognition (Optical Character Recognition, OCR) techniques. After preprocessing is completed, the system enables an OCR engine to perform character recognition on the original document. In the recognition process, the OCR engine analyzes the text area according to the layout of the original document, recognizes the text with different fonts, sizes and formats, and converts the text into text information. When extracting the target text, the system locates a text region or text range containing the difference fragment according to the difference position, such as a scanned page number. To maintain the contextual integrity of the text, the system may also extract text content before and after the difference segments, including text content of adjacent paragraphs or chapters, as desired. Finally, the system takes the extracted text content as a target text.

Optionally, the system first performs structural analysis on the original document, and identifies the hierarchical structure of the original document, including a main directory, a subdirectory, a chapter title, a paragraph identifier, and the like. Based on the structured information and the discrepancy locations, the system can quickly determine the chapter range in which the discrepancy piece is likely to be located. And then extracting the whole chapter or all texts corresponding to the page number or paragraph texts according to the difference positions, such as the chapter number, the page number or the paragraph mark and the starting position and the ending position of the chapter, the page number or the paragraph mark. If a difference fragment is located in a particular paragraph, the system, when locating the paragraph, will extract a range of text content from before the paragraph start position to after the difference fragment end position to ensure the contextual integrity of the traceability analysis. In addition, if the difference segment is located near the page boundary, the system will also extract the text content of the adjacent page to ensure that the context information of the difference segment is fully preserved. Finally, the system takes the extracted text content as a target text.

And S20, splitting the target text into independent text fragments according to separators, and comparing the text fragments with the difference fragments to determine the similarity between each text fragment and each difference fragment.

In this embodiment, since the difference fragment may be different from the original text fragment in the original document, it is necessary to compare the text fragment of the extracted target text with the difference fragment and calculate the similarity between each text fragment and the difference fragment.

Specifically, the system first needs to determine separators for splitting the target text, including but not limited to line breaks, paragraph marks (e.g., commas, periods), or logical separators for organizing text in other documents. In this way, the target text is divided into a plurality of separate text segments, each of which may contain a portion of the text of the difference segment. For example, the system uses a linefeed to identify different paragraphs in the target text and treats each paragraph as a separate piece of text.

And then, the system adopts a text comparison algorithm to compare the split text fragments with the difference fragments one by one. In the comparison process, the system calculates the similarity between each text segment and the difference segment. The computation of similarity may be based on a variety of text analysis means including, but not limited to, string matching algorithms (e.g., edit distance functions), semantic analysis algorithms (e.g., bag of words models), and sequence alignment algorithms. These algorithms can capture the similarity between texts from different angles, such as cosine similarity, evaluate the similarity by calculating the cosine value of the included angle between vectors, while edit distance measures the similarity by calculating the minimum number of edit operations required to convert one string to another.

Alternatively, the system may employ an edit distance function to calculate the similarity between the text segment and the difference segment. The edit distance, also called the Levenshtein distance, is a calculation method used to measure the difference between two sequences. It determines the similarity of two strings by calculating the minimum number of single character editing operations required to convert one string to another. Editing operations include inserting a character, deleting a character, or replacing a character.

Specifically, after splitting the target text into a plurality of separate text segments, the system further needs to convert these text segments into a form suitable for string comparison. The system first extracts each text segment in the target text and converts it to plain text format, removing any possible format tags or non-text elements, such as HTML tags, image descriptions, etc. The system then cleans the extracted text segment, including removing blank characters, unicode cases, and special characters or symbols, to ensure consistency and comparability of the character string.

After the preprocessing operation is completed, each text segment is converted into a clean character string, which contains the entire content of the text segment and does not contain any elements that may interfere with the comparison. Then, the system uses these character strings as inputs, and applies an edit distance algorithm to perform similarity calculation. The edit distance function is calculated as follows:

First initialize a size to Wherein, in the matrix of (a), wherein,AndThe length of the character string for each difference segment and text segment, respectively. Each cell of the matrix represents a minimum edit distance from the prefix of the difference fragment to the prefix of the text fragment. The calculation process starts with the first row and first column of the matrix, gradually filling the entire matrix.

Initializing a first row and a first column of the matrix, filling the first row and the first column of the matrix with values from 0 to a string length, respectively, representing edit distances from the difference segment to the empty string and from the empty string to the target text segment, respectively.

Next, the remaining cells of the matrix, i.e., each character of the two strings, are traversed. For each cell in the matrix, a value is calculated according to the following rule:

If the current character is the same, the value of the current cell is equal to the value of the upper left cell because no editing operation is required.

If the current character is different, the value of the current cell is the minimum of three editing operations, namely the minimum of the upper left cell (representing the replacement operation), the left cell (representing the insert operation), and the upper cell (representing the delete operation), plus one.

In this way, each cell in the matrix will eventually contain the edit distance between the two substrings that will reach that point. After the entire matrix is filled, the cell in the lower right corner will contain the edit distance between the two complete strings. To convert the edit distance to similarity, the system may normalize the edit distance, for example, by subtracting the edit distance from the sum of the two string lengths and then dividing by the sum of the two string lengths to obtain a similarity score between 0 and 1, where 0 represents the exact difference and 1 represents the exact identity.

Illustratively, assuming the strings "kitten" and "sitting" are compared, a matrix of 6x7 is created and then populated according to the rules described above. Finally, the cell in the bottom right hand corner of the matrix represents the minimum number of editing operations required to convert "kitten" to "sitting", i.e., the edit distance is 3 (replace "k" with "s", insert one "i", and replace "e" with "i"). The similarity score is thenThe strings "kitten" and "sitting" are shown to have about 76.9% similarity.

Alternatively, the system may employ a cosine similarity algorithm to calculate the similarity between the text segment and the difference segment. Cosine similarity is a similarity measure that measures the angle between two non-zero vectors, and is commonly used in text processing to evaluate the similarity of two vectors by calculating their angle cosine values. In text analysis, cosine similarity can be calculated by converting text into vectors in vector space.

After splitting the target text into a plurality of separate text fragments and converting into strings, the system first needs to convert these strings into vectors. This step typically involves vectorization of text, where each string (the string converted by a text segment or a difference segment) is represented as a vector in a high-dimensional space. Vectorization can be achieved by various methods, such as the Bag of Words model (Bag of Words) or TF-IDF (Term Frequency-Inverse Document Frequency, a statistical method), etc.

Once the difference segments and each text segment are converted into vectors, the system can calculate cosine similarity between them. The calculation formula of cosine similarity is as follows:

Wherein, Representing vectorsSum vectorIs used for the dot product of (a),AndRespectively represent vectorsSum vectorThe dot product calculates the sum of the projections of the two vectors in the same direction, while the modulus normalizes the length of the vectors, ensuring that the similarity is not affected by the vector size.

By calculating the cosine similarity between the text segment vector and the difference segment vector, the system can obtain a value between-1 and 1, where 1 represents the exact same direction (i.e., exact similarity), -1 represents the exact opposite direction (i.e., exact dissimilarity), and 0 represents that the vectors are orthogonal with no similarity.

To convert cosine similarity to a more intuitive similarity percentage, the system may normalize the computed cosine values, for example, by dividing the cosine value by 2, plus 1, to obtain a value between 0 and 1, where 0 represents the exact difference and 1 represents the exact difference.

For example, if the cosine similarity calculation result of the text segment "kitten" and the difference segment "sitting" is 0.4, the normalized similarity score isThis indicates that the similarity between "kitten" and "sitting" is 70%.

In this way, the system can provide a quantitative assessment of the similarity between each text segment and the difference segment, thereby providing valuable information for traceability analysis.

And step S30, taking the text segment with the similarity reaching or exceeding a preset threshold value as a candidate segment, determining the text segment with the highest similarity from the candidate segments as a tracing segment, and returning the position coordinates of the tracing segment in the original document.

In this embodiment, the system predetermines a predetermined threshold of similarity, which is predefined, for determining whether the similarity between the text segment and the difference segment is sufficiently high, so as to screen and determine that the text segment most likely to include the difference source is a potential traceable segment. The preset threshold value can be adjusted according to specific application scenes and requirements so as to ensure the accuracy and flexibility of the screening process.

And traversing the similarity scores of all the text fragments by the system, and screening out the text fragments with the similarity scores reaching or exceeding a preset threshold value to form a candidate fragment list. These candidate segments in the candidate segment list are considered to be text segments that are highly correlated to the difference segment.

In the candidate segment list, the system further compares the similarity scores between each candidate segment and the difference segments, and finds out the candidate segment with the highest similarity score. The text segment corresponding to this candidate segment will be determined to be the trace-source segment, since it most likely contains the source information of the difference segment.

The system records the position coordinates of the trace-source segment in the original document, which may be the start position and the end position of the character level, or may be the line number and the column number, depending on the design of the structure of the original document. The purpose of recording the position coordinates is to be able to quickly locate the specific position of the trace-source segment in the original document during the subsequent analysis or display process.

Alternatively, the system may employ an efficient ranking algorithm, such as a quick ranking or a merge ranking, to rank the similarity scores of all text segments. After the similarity scores between each text segment and the different segments are calculated, the similarity scores are used as the basis of sorting.

Through a rapid ordering algorithm, the system can recursively divide the text fragments into two groups of high similarity and low similarity, and finally a list arranged according to similarity scores is obtained.

The merging and sorting are realized by dividing the text fragments into smaller sub-lists and gradually merging the sub-lists, so that the stability and the efficiency of sorting are ensured.

By using the sorting algorithm, the system can quickly locate the text segment with the highest similarity with the difference segment, and provide an accurate starting point for tracing analysis.

Alternatively, the system may utilize a heap data structure, particularly a maximum heap, to maintain similarity scores for candidate segments. In the process of calculating the similarity score, the system continuously updates the maximum heap, and ensures that the top of the heap is always the fragment with the highest current similarity. Because the heap data structure allows quick access to the largest element, the system can acquire the text fragment with highest similarity in constant time, thereby greatly improving the retrieval efficiency. This approach is particularly suitable for handling a large number of candidate segments, and can ensure that the system finds the most relevant text segment in the shortest time.

Alternatively, the system may maintain a mapping table that associates each text segment with its coordinates in the original document. This mapping table may be a hash table or dictionary in which keys are text fragments and values are the location information of the text fragments in the original document. Once the traceability segment is determined, the system can immediately find the specific position of the traceability segment in the original document through the mapping table. The method not only quickens the process of position searching, but also improves the efficiency of integral traceability analysis. By the method, the system can provide accurate original text position information of the text fragments in the original document, so that a user can quickly locate specific positions of the difference fragments.

Through the scheme, the text fragments most similar to the difference fragments can be found, the processing speed and the processing efficiency can be improved, and accurate and efficient traceability analysis results can be provided for users.

Referring to fig. 2, after step S20, the document comparison tracing method according to the second embodiment of the present application includes steps S210 to S230:

And S210, creating a PDF analysis instance, and transmitting the difference fragment and the target text to the PDF analysis instance as input parameters.

In the present embodiment, the PDF parsing example is a tool dedicated to processing a PDF-format document, which is capable of reading and interpreting the structure and contents of a PDF file.

The system passes the difference fragment and the target text as input parameters to the PDF parsing instance. The difference segments are the text portions that are identified as not matching between the two original documents to be compared, and the target text is the entire text content within a range that contains these difference segments or difference locations.

The system may employ an open source PDF processing library such as Apache pdgbox (an open source Java library for processing PDF documents), iText (a Java class library for generating PDF documents), or PDFMiner (a Python library for parsing PDF documents).

These source PDF processing libraries provide rich APIs (Application Programming Interface, application programming interfaces) that allow programs to read the text content, metadata, structural information, etc. of PDF files and to be able to operate at the page level. The system may utilize the API to create PDF parsing instances and prepare for text retrieval and matching.

And S220, searching and matching the target text by utilizing the PDF analysis example, and determining the position information and the original text segment corresponding to the difference segment.

In this embodiment, the system uses the function of the created PDF parsing instance, such as full text search or text extraction, to search and match the target text, so as to determine the location information of the difference segment in the target text and the corresponding original text segment. In full text searching or text extraction, page-by-page analysis of the target text may be involved to ensure accurate localization to the variant segment.

The system may employ text localization algorithms, such as text content based searches or layout analysis based methods, to determine the location of the divergent segments. For example, the system may determine coordinates or page numbers of the difference segment corresponding to the target text by comparing the difference segment with text content in the target text, thereby obtaining textual location information of the difference segment in the original document.

It can be understood that the location information is the location of the difference segment in the target text, and may be a page number, coordinates (e.g., y-position on the page), or other forms of identification of the text on the page, and the original text location information refers to the corresponding location of the difference segment in the original document. If the positioning mark in the target text is consistent with the positioning mark in the original document, for example, the positioning mark is based on the same page number system, the position information is consistent with the original text position information.

And step S230, matching each difference segment with the corresponding position information and the original text segment to generate a tracing result.

In this embodiment, the system will associate the difference segment with its corresponding location information in the target text, and the original text segment based on the analysis result of step S220. And then matching the difference fragments with the position information and the corresponding original text fragments to generate a tracing result.

Or generating a tracing result containing the difference fragment, the corresponding original text position information and the corresponding original text fragment according to the position of the original text fragment in the original document, for example, mapping the position information of the target text to the original text position information in the original document, thereby providing the original coordinates about the difference fragment for the user.

Further, step S220 may include steps S221 to S223:

Step S221, identifying the layout structure of the target text through the PDF analysis example, and marking all text units in the target text.

In this embodiment, the system identifies the layout structure of the target text by PDF parsing examples and marks all text units. This step is the basis for understanding the document structure of the original document, and involves using PDF parsing tools to analyze and understand the arrangement and organization of text in the original document. The system needs to identify the different text units in the original document, such as paragraphs, titles, list items, etc., typically by analyzing the visual characteristics of the text's font size, style, spacing, and location. For example, the system may use a deep learning model to identify text line and word boundaries, and their relative locations on the page, to construct a structured document layout representation.

And step S222, matching the difference fragment with each text unit, and determining a matching result of the difference fragment and each text unit.

In this embodiment, the difference segment is matched with each identified text unit, and the degree of matching (i.e., the similarity) between the difference segment and each text unit is determined based on a preset threshold. In this process, the matching result between the difference fragment and each text unit is calculated using a PDF parsing tool.

It should be noted that, when comparing each text unit with the difference fragment by using the PDF parsing tool, the matching result only has two cases of successful matching and unsuccessful matching. And if the content of the difference fragment is different from the text fragment in the original document, the matching is failed.

And S223, based on the matching result, taking the text unit successfully matched as an original text segment of the difference segment, and determining the position information of the original text segment in the target text.

In this embodiment, when a matching result between a differential segment and each text unit includes a text unit that is successfully matched, the text unit that is successfully matched is used as an original text segment of the differential segment, original text position information of the original text segment in an original document is determined according to position information of the original text segment in a target text, and specific coordinates of the differential segment in the original document are output based on the original text position information.

Referring to fig. 3, in the document comparison tracing method according to the third embodiment of the present application, step S30 further includes steps S310 to S340:

And step S310, identifying a separator in the target text, and splitting the target text into independent text fragments by using the separator.

Note that the separator may be a line feed, paragraph marks (e.g., commas, periods), or other logical separators used to organize text in the document. The purpose of splitting the target text into separate text segments is to cut the long text into smaller, easier to handle segments so that it can be more accurately compared to the difference segments.

The system can identify and extract text content conforming to a specific mode by adopting regular expression matching, a rule-based method or a sequence labeling technology trained by a machine learning model, and the like. Once the separators are identified, the system will segment the continuous text stream into a plurality of separate text segments, each text segment containing one or more sentences, paragraphs, or entries, based on the separators.

Note that regular expression matching can match strings in text according to a predefined pattern. For example, the system may use regular expressions to match common delimiters such as line breaks, commas, periods, etc., and take them as the basis for splitting text.

In addition, the system may also use natural language processing (Natural Language Processing, NLP) techniques to identify natural separation points in text. For example, syntactic analysis techniques may help the system understand the structure of sentences and identify boundaries between sentences. In this way, the system can identify not only the separation points that are clear by punctuation, but also the separation points that are implied by semantics and context, such as the end and start of a paragraph.

After identifying the separators, the system will split the target text into a series of separate text segments based on the separators. Each text segment contains a complete semantic unit, such as a sentence or a paragraph.

And step S320, converting each text segment and each difference segment into character sequences, and comparing the character sequences of the text segments and the character sequences of the difference segments character by character.

In this embodiment, the conversion to a character sequence may include encoding the text fragments and the difference fragments into a unified character set, such as UTF-8, to ensure consistency of the comparison. Character-by-character comparison may be accomplished through simple cyclic traversal, or through more efficient algorithms such as dynamic programming or rolling hashing (e.g., rabin-Karp algorithms) that ensure that character-level matching operations are performed efficiently, even in large amounts of text data.

And S330, counting the number of character matching and the number of character mismatch on the same position of the character sequence of the text segment and the character sequence of the difference segment.

In this embodiment, by maintaining two counters, one for recording the number of characters that match and the other for recording the number of characters that do not match. In the comparison process, the match counter is incremented each time a character match is found at the same location, whereas if there is no match, the mismatch counter is incremented.

Step S340, calculating the similarity between each text segment and the difference segment based on the number of character matches and the number of character mismatches.

In this embodiment, the similarity calculation may use various means, such as a simple matching rate (the number of matching characters divided by the total number of characters), the reciprocal of the Levenshtein distance (edit distance), or other algorithms, such as Jaccard similarity or cosine similarity.

Referring to fig. 4, in the document comparison tracing method according to the fourth embodiment of the present application, step S40 further includes steps S410 to S420:

And step S410, screening out the text fragments with the similarity reaching or exceeding a first preset threshold value from the target text, and taking the text fragments as first candidate fragments.

In this embodiment, the system will traverse the similarity scores for all text segments and compare with a preset first threshold. The first threshold is a predefined similarity level that determines whether the text segments are sufficiently similar to be considered potential traceability candidates. The screening means may comprise a simple conditional statement that is incorporated into the first set of candidate segments if the similarity score of the text segment is greater than or equal to a threshold. In addition, the system may also employ a ranking technique, where all text segments are first ranked in descending order of similarity score, and then N segments with highest scores are selected as the first candidate segment, where N may be adjusted according to the actual situation.

And S420, based on the first candidate segment, taking the text segment with the similarity reaching or exceeding a second preset threshold value as a second candidate segment, determining the text segment with the highest similarity in the second candidate segment as the tracing segment, and returning the position coordinates of the tracing segment in the original document.

In this embodiment, the second preset threshold is generally higher than the first threshold, so as to ensure that the second candidate segment screened out has higher reliability. After determining the second candidate segment, the system will compare the similarity scores of these candidate segments to find the candidate segment with the highest score. This highest scoring candidate segment will be considered the traceable segment, as it most likely contains the source of the differential segment.

Finally, the system will return the coordinates of the trace-source clip in the original document, which may be page numbers, line numbers, column numbers, or character offsets, etc.

Further, step S420 may include steps S421 to S423:

Step S421, if the text segment with the similarity reaching or exceeding a second preset threshold value does not exist, vector representations of the target text and the difference segment are respectively obtained.

Step S422, calculating semantic similarity between each text segment in the target text and the difference segment.

Step S423, determining a text segment with the highest semantic similarity with the difference segment as the tracing segment in the target text, and returning the position coordinates of the tracing segment in the original document.

It should be noted that, step S421 is a complementary measure to be taken in a case where the similarity of no text segment reaches or exceeds the second preset threshold.

When no screening result exists under the screening of the second preset threshold value, the original text position information of the difference fragment in the original document and the corresponding original fragment can be determined through semantic similarity matching.

In particular, the system may employ BERT (Bidirectional Encoder Representations from Transformers) model-based semantic similarity matching. BERT is a pre-trained deep learning model, commonly used in the field of natural language processing. The BE-based semantic similarity matching process typically involves the following steps:

Firstly, preprocessing operation is needed to be carried out on the input text fragments and the difference fragments, and the method comprises the steps of splitting texts corresponding to the input text fragments and the difference fragments into words or sub-words through a word segmentation device, wherein the words or the sub-words are basic units which can be understood and processed by the BERT model. Then, to indicate the beginning and end of the BERT model text, special markers, such as [ CLS ] and [ SEP ], need to be added at the beginning and end of the text content, respectively, wherein the output vector of the [ CLS ] markers is typically used as a representation of the entire sentence or paragraph. These markers help the BERT model distinguish between the boundaries of sentences. After the word segmentation and the special mark addition are completed, the word segmentation result is converted into an index in a model word list, so that the text is converted into a digital sequence which can be processed by the BERT model.

The processed text is then input into the BERT model, obtaining a vector representation of each word or subword. The BERT model outputs deep bi-directional contextualized representations of each word or subword.

The vectors output by the BERT model are then used to measure semantic similarity between the text segment and the difference segment. For example, the directional consistency of two vectors is evaluated by calculating the cosine similarity between the two vectors, or the dot product of the two vectors is directly calculated, the larger the dot product result is, the higher the similarity of the two text fragments is, etc.

Finally, in the similarity score calculated in step S422, the highest value of the similarity score is found, and the text segment corresponding to the highest value is identified as the trace-source segment. The position coordinates may be determined by tracking the starting and ending positions of the text segments with highest similarity scores in the original document.

The application provides a document comparison and tracing device which comprises at least one processor and a memory in communication connection with the at least one processor, wherein the memory stores instructions which can be executed by the at least one processor, and the instructions are executed by the at least one processor so that the at least one processor can execute the document comparison and tracing method in the first embodiment.

Referring now to FIG. 5, a schematic diagram of a document contrast tracing device suitable for use in implementing embodiments of the present application is shown. The document comparison traceability device in the embodiment of the application can comprise various hardware and software components for realizing the scheduling method of the patrol task. The document contrast tracing device shown in fig. 5 is only an example, and should not impose any limitation on the functions and scope of use of the embodiment of the present application.

As shown in fig. 5, the document matching and tracing apparatus may include a processing device 1001 (e.g., a central processor, a graphic processor, etc.), which may perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1003 into a random access Memory (RAM: random Access Memory) 1004. In the RAM1004, various programs and data required for the document contrast tracing apparatus operation are also stored. The processing device 1001, the ROM1002, and the RAM1004 are connected to each other by a bus 1005. An input/output (I/O) interface 1006 is also connected to the bus. In general, a system including an input device 1007 including, for example, a touch screen, a touch pad, a keyboard, or the like, an output device 1008 including, for example, a Liquid crystal display (LCD: liquid CRYSTAL DISPLAY), a speaker, a vibrator, or the like, a storage device 1003 including, for example, a magnetic tape, a hard disk, or the like, and a communication device 1009 may be connected to the I/O interface 1006. The communication means 1009 may allow the document contrast tracing device to communicate wirelessly or by wire with other devices to exchange data. While a document contrast traceability device with various systems is shown in the figures, it should be understood that not all of the illustrated systems are required to be implemented or provided. More or fewer systems may alternatively be implemented or provided.

In particular, according to embodiments of the present disclosure, the processes described above with reference to flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through a communication device, or installed from the storage device 1003, or installed from the ROM 1002. The above-described functions defined in the method of the disclosed embodiment of the application are performed when the computer program is executed by the processing device 1001.

The document comparison tracing equipment provided by the application can solve the technical problem of how to improve the accuracy of document comparison tracing by adopting the document comparison tracing equipment method in the embodiment. Compared with the prior art, the beneficial effects of the document comparison tracing device provided by the application are the same as those of the document comparison tracing method provided by the embodiment, and other technical features of the document comparison tracing device are the same as those disclosed by the method of the previous embodiment, so that the description is omitted.

It is to be understood that portions of the present disclosure may be implemented in hardware, software, firmware, or a combination thereof. In the description of the above embodiments, particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

The present application provides a computer readable storage medium having computer readable program instructions (i.e., a computer program) stored thereon for performing the document matching and tracing method in the above-described embodiments.

The computer readable storage medium provided by the present application may be, for example, a U disk, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, or device, or a combination of any of the foregoing. More specific examples of a computer-readable storage medium may include, but are not limited to, an electrical connection having one or more wires, a portable computer diskette, a hard disk, a random access Memory (RAM: random Access Memory), a Read-Only Memory (ROM), an erasable programmable Read-Only Memory (EPROM: erasable Programmable Read Only Memory or flash Memory), an optical fiber, a portable compact disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this embodiment, a computer-readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, or device. Program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to electrical wiring, fiber optic cable, RF (Radio Frequency) and the like, or any suitable combination of the foregoing.

The computer readable storage medium may be included in the document matching and tracing device, or may exist alone and not be assembled into the document matching and tracing device.

The computer readable storage medium carries one or more programs, when the one or more programs are executed by a document comparison tracing device, the document comparison tracing device is caused to respond to a tracing request for a difference fragment, determine the range of the difference fragment in an original document according to the difference fragment and the difference position corresponding to the difference fragment, identify texts in the range as target texts, split the target texts into independent text fragments according to separators, compare the text fragments with the difference fragment, determine the similarity between each text fragment and the difference fragment, take the text fragment with the similarity reaching or exceeding a preset threshold as a candidate fragment, determine the text fragment with the highest similarity as a tracing fragment from the candidate fragments, and return the position coordinates of the tracing fragment in the original document.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of remote computers, the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN: local Area Network) or a wide area network (WAN: wide Area Network), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The modules involved in the embodiments of the present application may be implemented in software or in hardware. Wherein the name of the module does not constitute a limitation of the unit itself in some cases.

The readable storage medium provided by the application is a computer readable storage medium, and the computer readable storage medium stores computer readable program instructions (namely computer programs) for executing the document comparison tracing method, so that the technical problem of how to improve the accuracy of document comparison tracing can be solved. Compared with the prior art, the beneficial effects of the computer readable storage medium provided by the application are the same as those of the document comparison tracing method provided by the embodiment, and the description is omitted here.

An embodiment of the present application provides a computer program product, including a computer program, where the computer program when executed by a processor implements the steps of the document comparison tracing method described above.

The computer program product provided by the application can solve the technical problem of how to improve the accuracy of document comparison and tracing. Compared with the prior art, the beneficial effects of the computer program product provided by the embodiment of the application are the same as the beneficial effects of the document comparison tracing method provided by the embodiment, and are not repeated here.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein, or any application, directly or indirectly, within the scope of the application.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present application, and is not intended to limit the scope of the application, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. The document comparison tracing method is characterized by comprising the following steps of:

2. The document matching and tracing method according to claim 1, wherein said step of determining a range of a difference fragment in an original document according to the difference fragment and a difference position corresponding to the difference fragment in response to a tracing request for the difference fragment, and identifying a text within the range as a target text includes:

3. The document comparison and tracing method of claim 1, wherein said step of determining a range of a difference fragment in an original document according to the difference fragment and a difference position corresponding to the difference fragment in response to a tracing request for the difference fragment, and identifying a text within the range as a target text further comprises:

4. The document matching and tracing method according to claim 1, wherein after the step of determining a range of the difference fragment in the original document according to the difference fragment and a difference position corresponding to the difference fragment in response to the tracing request for the difference fragment and identifying a text within the range as a target text, the method comprises:

5. The document matching and tracing method of claim 4, wherein said step of determining the location information and the original document fragment corresponding to the difference fragment comprises:

6. The document matching and tracing method of claim 1, wherein said step of splitting said target text into separate text segments according to delimiters and comparing said text segments with said difference segments, and determining the similarity between each of said text segments and said difference segments comprises:

7. The document matching and tracing method according to claim 1, wherein the step of using the text segment with the similarity reaching or exceeding a preset threshold as a candidate segment, determining the text segment with the highest similarity from the candidate segments as a tracing segment, and returning the position coordinates of the tracing segment in the original document comprises:

8. The document matching and tracing method of claim 7, wherein said step of using said text segment having said similarity reaching or exceeding a second preset threshold as a second candidate segment based on said first candidate segment, and determining said text segment having said highest similarity in said second candidate segment as said tracing segment, and returning the position coordinates of said tracing segment in said original document, further comprises:

9. A document comparative tracing device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the computer program being configured to implement the steps of the document comparative tracing method of any one of claims 1 to 8.

10. A computer storage medium, characterized in that the computer storage medium is a computer readable storage medium, on which a computer program is stored, which computer program, when being executed by a processor, implements the steps of the document comparison and tracing method according to any one of claims 1 to 8.