[go: up one dir, main page]

WO2024056457A1 - A method for detection of modification in a first document and related electronic device - Google Patents

A method for detection of modification in a first document and related electronic device Download PDF

Info

Publication number
WO2024056457A1
WO2024056457A1 PCT/EP2023/074289 EP2023074289W WO2024056457A1 WO 2024056457 A1 WO2024056457 A1 WO 2024056457A1 EP 2023074289 W EP2023074289 W EP 2023074289W WO 2024056457 A1 WO2024056457 A1 WO 2024056457A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
document
token
modification
determining
Prior art date
Application number
PCT/EP2023/074289
Other languages
French (fr)
Inventor
Manikant THATIPALLI
Niranjan SENTHILVASAN
Anshuman PRADHAN
Sunil Kumar CHINNAMGARI
Original Assignee
Maersk A/S
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Maersk A/S filed Critical Maersk A/S
Publication of WO2024056457A1 publication Critical patent/WO2024056457A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/64Protecting data integrity, e.g. using checksums, certificates or signatures
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/42Document-oriented image-based pattern recognition based on the type of document

Definitions

  • the present disclosure pertains to the field of electronic document control and management.
  • the present disclosure relates to a method for detection of modification in a first document and a related electronic device and related electronic device.
  • the method comprises obtaining first data indicative of tabular data and/or non-tabular data from the first document.
  • the method comprises obtaining second data indicative of tabular data and/or non-tabular data from a second document.
  • the method comprises determining modification data, e.g., by applying a textual sequence comparison logic to the first data and the second data.
  • the method optionally comprises providing, based on the modification data, an output.
  • an electronic device comprises a memory, a processor, and an interface. The electronic device is configured to perform any of the methods disclosed herein.
  • the disclosed electronic device and method provide a more efficient and robust authentication and processing of documents, as well as an early detection of any modification. Further, the disclosed electronic device and method are applicable to various types of documents (such as text documents, Tabular documents in any format, such as PDF, PS, Word, Pages etc.) to identify and automatically detect a modification caused by a party.
  • the disclosed electronic device and method provide flexibility in the type of files and a solution in multiple domains dealing with contractual documents across an organization.
  • Fig. 1 A is a diagram illustrating schematically an example representation of document where a modification took place
  • Fig. 1 B is a diagram illustrating schematically example data frame according to this disclosure
  • Fig. 2 is a diagram illustrating schematically an example process for generating a data frame according to this disclosure
  • Fig. 3 is a diagram illustrating schematically an example process for detecting a modification type according to this disclosure
  • Fig. 4 is a diagram illustrating schematically an example representation of a second document (such as an original document) and an example representation of a first document (such as a signed document where a modification took place) according to this disclosure
  • Figs. 5A-5B is a flow-chart illustrating an exemplary method, performed by an electronic device, for detection of modification in a first document according to this disclosure
  • Fig. 6 is a block diagram illustrating an exemplary electronic device according to this disclosure.
  • the present disclosure enables the processing and authentication of documents, such as legal documents.
  • a document (e.g., a first document and/or a second document) disclosed herein may be seen as an electronic document, such as a document that can be processed by a computing device (e.g., by the disclosed electronic device).
  • a document (e.g., a first document and/or a second document) disclosed herein may comprise tabular and/or non- tabular data.
  • tabular data can be seen as data (e.g., text which can comprise tokens and/or strings and/or words) arranged in a table (e.g., arranged by columns and rows).
  • non-tabular data may be seen as data (e.g., text which can comprise tokens and/or strings and/or words) arranged in paragraphs.
  • a legal document such as a contract
  • a clause dictates certain conditions under which parties to a contract agree to act during the term of the contract. Hence, liability and/or obligation depends on how each clause is written in a contract. Any modification of a legal document needs to be detected.
  • a tamper detection utility is hereby disclosed to authenticate legal documents (such as a contract).
  • the tamper detection utility can be seen as a software that detects any modification of a legal document with respect to a previous version.
  • the tamper detection utility can authenticate the integrity of a document and let the document proceed to be uploaded on a database upon successful authentication.
  • the present disclosure allows performing a comparison between plurality of PDF containing tables which can be scanned or text files, depending on how the document is signed.
  • the present disclosure permits determining which lines of a document (such as a clause of a contract document) have been modified and highlighting the modification. The modification can then be reviewed and sent back to the signing party, if required.
  • the present disclosure provides an automation boost and allows a reduction in processing documents, e.g., when compared to the manual comparison.
  • the present disclosure provides a standardization of extraction procedures which leads to supporting a variety of PDF and docx documents irrespective of their format and template.
  • the present disclosure provides a generalization required around comparison technique.
  • Documents can be scanned and/or non-standard PDF’s.
  • the noise varies across each type of documents, giving rise to many different challenges and scenarios for Optical Character Recognition, OCR, to handle during extraction. This can require an accurate working of image processing techniques, such as brightness, skewness, and sharpness corrections.
  • the disclosed technique involves, inter alia, the text extraction at a word level and a top- down approach for extraction of sequence of words.
  • To be able to process tables in between text provides some complexity in pipeline as the text needs to be extracted at a cell level sequentially, which in turn demands a very efficient table detection algorithm. Any misalignment of extraction sequence between the two documents leads to false positives in identifying the tamper cases.
  • the present disclosure provides a technique that handles tabular data.
  • the present disclosure provides a technique for detecting a modification between a first document (such as a returned version of a document, such as a signed version of a document) and a second document (such as an original version of the document).
  • the disclosed technique allows detecting, based on the textual sequence comparison logic, a modification (such as addition and/or alteration and/or deletion) of tabular and/or non- tabular data in the first document in respect to the second document.
  • a modification such as addition and/or alteration and/or deletion
  • the present disclosure allows comparing a first document with a second document.
  • the disclosed technique can be particularly important for preventing document modification, such as document tampering (e.g., legal document tampering).
  • the disclosed technique can be seen as a document tampering detection technique.
  • a second document disclosed herein may be seen as an original version of a first document.
  • the first document is the second document with one or more modifications.
  • the first document may comprise modified tabular data and/or modified non-tabular data in relation to the second document.
  • the first document may comprise a signature in relation to the second document.
  • the one or more modifications can, for example, result from one or more of: an addition of new tabular data and/or non-tabular data, an alteration of the tabular data and/or non-tabular data, and a deletion of the tabular data and/or non-tabular data.
  • the first document is identical to the second document. Put differently, in some examples, the first document may be based on the second document when no modification is detected by the electronic device.
  • the first document and/or the second document disclosed herein can be of any format.
  • the format may be one or more of: a PDF format, a docx, format, a doc format, and a scanned version of a PDF document, among others.
  • the first document and/or the second document disclosed herein can have any template. Put differently, the first document and/or the second document disclosed herein may not necessarily follow a predetermined structure (e.g., a legal document which comprises clauses).
  • a textual sequence comparison logic may be seen as a technique to identify one or more modifications between a first and a second document.
  • the textual sequence comparison logic may detect differences between two versions of a document, such as between a first document and a second document, where the second document may be an original version of the first document.
  • Fig. 1 A shows a diagram 1 illustrating schematically an example representation of document where a modification took place.
  • the document can be a text-based document, including tabular data and/or non-tabular data.
  • the document can be a PDF file, a scanned version of a PDF document.
  • the document can comprise a signature.
  • Fig. 1A shows a representation 14 of a first document, and a representation 16 of a second document.
  • the second document is an original document, while the first document is a returned version of the second document bearing a signature illustrated by 12.
  • the present disclosure provides a technique to detect any modification between the first document and the second document.
  • the disclosed method is performed by an electronic device for detection of modification in the first document 14.
  • Fig. 1A shows that a modification may have occurred when comparing element 10, which may lead to a potential “No match” indicator for the element 10.
  • first data indicative of tabular data and/or non-tabular data is obtained from the first document (such as first document 14).
  • Second data indicative of tabular data and/or non-tabular data is obtained from a second document (such as second document 16).
  • Modification data is determined, e.g., by applying a textual sequence comparison logic to the first data and the second data.
  • the method optionally comprises providing, based on the modification data, an output. For example, a match parameter per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data is determined based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data.
  • the match parameter can be included in the modification data.
  • a data frame is for example generated by concatenating at least two of: the first data, the second data, and the match parameter.
  • a data frame can be seen as a data structure representing the extraction of the data from the unsigned (e.g., original and/or truth) second document and from the signed first document, and corresponding match parameters.
  • the data frame is for example in form of a table and/or a matrix.
  • Fig. 1 B shows a diagram illustrating schematically example data frame 2 according to this disclosure.
  • the data frame 2 comprises first data 14A, second data 16A and the corresponding set 38 of match parameters.
  • the first data 14A indicative of tabular data and/or non-tabular data is obtained from the first document (such as first document 14 of Fig. 1A).
  • the second data 16A indicative of tabular data and/or non-tabular data is obtained from a second document (such as second document 16 of Fig. 1A).
  • Modification data is determined, e.g., by applying a textual sequence comparison logic to the first data 14A and the second data 16A. For example, a match parameter (shown in column 38) per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data is determined based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data.
  • the match parameter can be included in the modification data.
  • the first data 14A includes for example page numbers in column 18, corresponding token in column 20, corresponding coordinates of a token of the first document in column 22, a table index (e.g., index) in column 24 showing in which table of the first document the token is present, and/or a table row index in column 26 showing in which row of a corresponding table of the first document the token is present.
  • a table index e.g., index
  • the second data 16A includes for example page number in column 28, corresponding token in column 30, corresponding coordinates of a token of the second document in column 32, a table index (e.g., index) in column 34 showing in which table of the second document the token is present, and/or a table row index in column 36 showing in which row of a corresponding table of the second document the token is present.
  • a table index e.g., index
  • the match parameter of column 38 indicates “Match” when a similarity parameter between the respective token of the first data and the corresponding token of the second data is above a threshold.
  • the match parameter of column 38 indicates “No Match” when a similarity parameter between the respective token of the first data and the corresponding token of the second data is not above a threshold.
  • Fig. 2 shows a diagram illustrating schematically an example process 500 for generating a data frame according to this disclosure.
  • Fig. 2 shows a process 500 for extracting first data from a first document 502 (e.g., a document with a signature) and/or second data from a second document 508 (e.g., an original version of the first document 502), for detection of modification data (e.g., one or more modifications) in the first document 502 with respect to the second document 508.
  • the process 500 can be carried out by the electronic device disclosed herein
  • the first document 502 can be an image-based document (e.g., a PDF document and/or a scanned version of a PDF document and/or an image captured by a camera, such as converted into a PDF format).
  • the first document 502 can be a text-based document (e.g., a PDF document and/or a docx document and/or a doc document).
  • the second document 508 can be an image-based document (e.g., a PDF document and/or a scanned version of a PDF document and/or an image captured by a camera, such as converted into a PDF format).
  • the second document 508 can be a text-based document (e.g., a PDF document and/or a docx document and/or a doc document).
  • the second document 508 can be document 16 of Fig. 1A.
  • the process 500 includes a check on whether the first document 502 meets the first criterion in 504.
  • the first criterion 504 is met by the first document 502 when the first document 502 includes a signature.
  • the first document 502 is based on the second document 508 by including a signature (e.g., signature 12 of Fig. 1A).
  • the signature is one of a physical signature (e.g., a wet signature and/or a pen- and-ink signature) and an electronic signature (e.g., a handwritten signature and/or typewritten signature).
  • the first document 502 can be document 14 of Fig. 1A.
  • the first document 502 meets the first criterion 504 when the first document 502 includes a signature, such as a physical signature.
  • the first document 502 meets, in some examples, the first criterion 504 when the first document 502 is an image-based document (e.g., the first document 502 is a scanned version of the second document 508).
  • the first data 512a e.g., first data 14A of Fig. 1B
  • the first document is an image-based document.
  • the first data 512a is obtained from an optical extraction technique, such as an Optical Character Recognition, OCR, technique.
  • OCR Optical Character Recognition
  • the first extraction technique enables extracting coordinates associated with the first data 512a.
  • a data frame 514 comprises the first data 512a.
  • a second extraction technique 510 is applied to the first document 502.
  • the first document 502 does not meet the first criterion 504 when the first document 502 is a textbased document.
  • the first data 510a e.g., first data 14A of Fig. 1 B
  • second extraction technique 510 is a text mining technique and the first data 510a is obtained from the text mining technique.
  • the second extraction technique enables extracting coordinates associated with the first data 510a.
  • the data frame 514 comprises the first data 510a.
  • the process 500 includes a check on whether the second document 508 is an imagebased document.
  • the second data 512b e.g., second data 16A of Fig. 1 B
  • the second data 512b is obtained from an optical extraction technique.
  • the first extraction technique enables extracting coordinates associated with the second data 512b.
  • the data frame 514 comprises the second data 512b.
  • the process 500 includes a check on whether the second document 508 is a text-based document. For example, upon determining that the second document 508 is a text-based document (e.g., the second document is not an image-based document), the second data 510b (e.g., second data 16A of Fig. 1 B) is obtained from the second document 502 by applying a second extraction technique 510 to the second document 508.
  • the first data 510b is obtained from a text mining technique.
  • the second extraction technique enables extracting coordinates associated with the second data 510b.
  • the data frame 514 comprises the second data 510b.
  • the data frame 514 (e.g., data frame 2 of Fig. 1 B) is a data structure comprising the first data 512a, 510a, which is extracted from the first document 502 (e.g., as illustrated in 14 of Fig. 1A), and the second data 512b, 510b, which is extracted from the second document 508 (e.g., as illustrated in 16 of Fig. 1A).
  • Fig. 3 shows a diagram illustrating schematically an example process 600 for detecting a modification type according to this disclosure.
  • Fig. 3 shows how to detect modification and generate modification data (e.g., one or more modifications) indicative of a modification in a first document (e.g., a document with a signature, such as document 502 of Fig. 2) with respect to a second document (e.g., an original version of a document for signature, such as document 508 of Fig. 2).
  • the process 600 can be carried out by the electronic device disclosed herein.
  • the first document (e.g., first document 14, 502 of Figs. 1A and 2) is based on the second document (e.g., second document 16, 508 of Figs. 1A and 2) with one or more modifications.
  • the first document may comprise modified tabular data and/or modified non-tabular data, which are modified with respect to the second document.
  • a first data (e.g., first data 14A of Fig. 1 B) is obtained from the first document by applying either a first extraction technique (e.g., first extraction technique 512 of Fig. 2) or a second extraction technique (e.g., second extraction technique 510 of Fig. 2 to the first document.
  • a second data (e.g., second data 16A of Fig.
  • 1 B is obtained from the second document by applying either the first extraction technique or the second extraction technique to the second document.
  • a data frame 602 can include the first data and/or the second data.
  • the first document and/or the first data comprises one or more tokens.
  • a token can be seen as one or more of: a word, a number, and a string.
  • the first document comprises one or more tokens.
  • the second document and/or the second data comprises one or more tokens.
  • the electronic device disclosed herein can apply a textual sequence comparison logic 604 to detect the modification data between the first document and the second document.
  • the textual sequence comparison logic can include one or more of: a fuzzy matching logic and a longest common subsequence technique.
  • the fuzzy matching logic can be used to detect if there is a modification between the first document and the second document (as illustrated by MATCH/NO MATCH in Fig. 1 B).
  • the longest common subsequence technique can be used to determine the type of modification between the first document and the second document.
  • the fuzzy matching logic can identify a token of the first data which partially (e.g., not fully) matches a token of the second data, such as via a similarity parameter.
  • the similarity parameter indicates the similarity between the respective token of the first data and the corresponding token of the second data, e.g., how similar is a respective token of the first data to the corresponding token of the second data.
  • a respective token of the first data (e.g., a word) comprises five characters (e.g., letters), in which only four characters of the first data are identical to the five characters of the corresponding token of the second data (e.g., comprised in the second document, such as the original document), which leads to a similarity parameter (e.g., a similarity level) of 80%.
  • the respective token of the first data is similar to the corresponding token of the second data by, for example, 80%.
  • the respective token of the first data matches the corresponding token of the second data when the similarity parameter is above a threshold (e.g., no tampering of the respective token of the first data, such as illustrated by 606).
  • the respective token of the first data does not match the corresponding token of the second data when the similarity parameter is equal or below the threshold.
  • a match parameter for each respective token of the first data and for each corresponding token of the second data is added to the data frame 602 (e.g., data frame 514 of Fig. 3).
  • the match parameter indicates for each respective token whether the respective token of the first data matches the corresponding token of the second data.
  • the match column comprises the match parameter (e.g., “Match” or “No Match”) for each token comprised in the first data and the second data (as illustrated in Fig. 1 B).
  • the fuzzy matching logic initialises and/or populates the match column of the data frame 602 (e.g., data frame 514 of Fig.
  • the longest common subsequence determines the type of modification by comparing a respective sequence of tokens (e.g., one or more tokens, such as one or more words) of the first data with a corresponding sequence of tokens of the second data.
  • the longest common subsequence is applied to the data frame 602.
  • the modification type is indicative of a type of modification in a part of the first document (e.g., a modification type of a respective token).
  • the longest common subsequence is applied (e.g., by the electronic device) to the data frame when the respective token of the first data does not match the corresponding token of the second data (e.g., an occurrence of a “No Match” in the match column) to determine the type of modification carried out.
  • the respective sequence of tokens of the first data handled by the longest common subsequence technique comprises one or more tokens of the first data (e.g., such respective token of the first data requires a correction).
  • the respective sequence of tokens of the second data comprises one or more tokens of the second data to allow determining the modification type.
  • the longest common subsequence and a correction are iteratively applied (e.g., by the electronic device) to the data frame for each respective token of the first data associated with a match parameter indicating “No Match”.
  • the longest common subsequence and the correction are iteratively applied to the data frame until each respective token of the first data matches each corresponding token of the second data.
  • the respective token of the first data is, for example, compared with the corresponding token of the second data for providing a first longest common subsequence.
  • the respective token of the first data does not match the corresponding token of the second data.
  • a correction is performed (e.g., by the electronic device) in the respective token of the first data (e.g., comprising a modification) for providing a respective corrected token of the first data.
  • the respective corrected token of the first data is compared with the corresponding token of the second data for providing a second longest common subsequence.
  • the first longest common subsequence is compared with the second longest common subsequence.
  • the modification type can be determined based on the one or more correction carried out in the respective token of the first data when the comparison meets a second criterion 604A, 604B, 604C.
  • the comparison between the first longest common subsequence and the second longest common subsequence provides a similarity parameter, which is compared to a threshold, such as 95%. For example, the comparison meets the second criterion when the similarity parameter between the first longest common subsequence and the second longest common subsequence is above the threshold.
  • the respective corrected token of the first data comprises more tokens (e.g., words) in common with the corresponding token of the second data.
  • a longer second longest common subsequence in respect to a first longest common subsequence can, for example, be indicative of a higher similarity parameter between the respective token of the first data and the corresponding token of the second data.
  • the similarity parameter needs to satisfy the threshold for the comparison to meet the second criterion.
  • the respective token (e.g., a word) of the first data can be corrected by shifting upwards 610, or downwards 614 a respective sequence of tokens (e.g., a string, such as comprising one or more words) of the first data by a length until the respective sequence of tokens provides a match parameter indicative of a match between the respective sequence of tokens of the first data and a corresponding sequence of tokens of the second data.
  • the upward shifting correction 610 indicates that the modification type is a deletion type 608.
  • the downward shifting correction 614 for example indicates that the modification type is an addition type 612.
  • the comparison does not meet the second criterion when the similarity parameter between the first longest common subsequence and the second longest common subsequence is not above the threshold, such as 95%.
  • the modification type is determined by the electronic device as an alteration type 616.
  • the output 618 of the process 600 can include information indicating one or more modifications, and corresponding attributes (such as match parameter, similarity parameter and/or modification type).
  • the output 618 can provide an adjusted data frame including a highlight of the one or more modifications, which may be displayed or provided via a user interface object 620 (e.g., representation 704 of Fig. 4).
  • Fig. 4 shows a diagram 700 illustrating schematically an example representation 702 of a second document (such as an original document) and an example representation 704 of a first document (such as a signed document based on the original document where a modification took place) according to this disclosure.
  • the first document 704 shows a signature 708 and a modification illustrated by 710 where a number which was 50 in the second document 702 is now changed to 10. This is a modification that can be presented to a user in user interface using representation 704.
  • Figs. 5A-5B shows a flow-chart of an exemplary method 100, performed by an electronic device, for detection of modification in a first document according to the disclosure.
  • the electronic device is the electronic device disclosed herein, such as electronic device 300 of Fig. 6.
  • the method 100 comprises obtaining S102 first data indicative of tabular data and/or non- tabular data from the first document.
  • the method 100 comprises obtaining S102 first data indicative of at least one of tabular data and non-tabular data from the first document.
  • the method 100 comprises obtaining S102 first data indicative of tabular data and non-tabular data from the first document.
  • the first data comprises tabular data and/or non-tabular data from the first document.
  • the method 100 comprises obtaining S104 second data indicative of tabular data and/or non-tabular data from a second document.
  • the second data comprises tabular data and/or non-tabular data from the second document.
  • the method 100 comprises determining S106 modification data by applying S106A a textual sequence comparison logic to the first data and the second data.
  • the method 100 comprises providing S108, based on the modification data, an output.
  • the first document comprises tabular and/or non-tabular data, e.g., in Fig. 1A.
  • the second document comprises tabular and/or non-tabular data e.g., in Fig. 1 A.
  • tabular data can be data arranged in a table (e.g., arranged in one or more columns and in one or more rows).
  • non- tabular data can be data arranged in paragraphs.
  • non-tabular data can be data that is not in tabular form.
  • non-tabular data can be data which is not included in a table.
  • the second document can be seen as an original version of the first document.
  • the second document is an original document sent to a party for signature while the first document is the second document returned by the party after signature.
  • the first document is the second document returned by the party after signature.
  • the first document is based on the second document by including a signature.
  • the first document can be detected as based on the second document with one or more modifications.
  • the first document is a document for inspection, such as for detecting the one or more modifications with respect to the second document.
  • the first document comprises a signature.
  • including a signature in the first document can be seen as modifying the first document in relation to the second document.
  • the one or more modifications can be indicative of an inclusion of a signature in the first document.
  • including a signature in the first document can be seen as an acceptable modification of the first document in relation to the second document.
  • the signature indicates that a signing party approves and/or accepts one or more clauses comprised in the second document (e.g., original document).
  • the second document is the original document whose involved parties agreed on.
  • the signature is one of a physical signature and an electronic signature.
  • the first document e.g., document for inspection
  • the signature can be a physical signature.
  • a physical signature can be a wet signature and/or a pen-and-ink signature (e.g., a handwritten signature).
  • the first document is the second document with one or more modifications.
  • the modification data can be indicative of the one or more modifications.
  • the first document may comprise modified tabular data and/or modified non- tabular data in relation to the second document.
  • the first document may be identical to the second document when no modification (e.g., alteration) is detected between the first document and the second document (e.g., by an electronic device).
  • the modification data can include a match parameter indicating “Match” for each token.
  • the textual sequence comparison logic can detect the modification data (e.g., the one or more modifications) between the first document and the second document). Put differently, the textual sequence comparison logic can, for example, detect differences between two versions of a document, such as between a first document and a second document.
  • the textual sequence comparison logic can include one or more of: a fuzzy matching logic and a longest common subsequence technique.
  • the output can be provided to an entity (e.g., a company and/or an individual and/or an institution) so that a tampered document can be immediately detected (e.g., a legal document which is deliberately modified and/or a generic document intended to be inspected for detecting one or more modifications).
  • the output comprises information indicating one or more modifications, and corresponding attributes (such as match parameter, similarity parameter and/or modification type).
  • the output can include data indicating which tabular data and/or non-tabular data of the first document (e.g., the document under test) differs from corresponding data of the second document (e.g., the original version, and/or the true document).
  • the first document can be detected as a tampered document.
  • the first document can be a generic document (e.g., of non-legal nature) which may have been modified with respect to an original document.
  • the first document can be a legal document (e.g., the first document) which is deliberately modified with respect to an original document (e.g., the second document).
  • Such deliberate modification can be seen as a forgery of the original document (e.g., the second document), which can deem the original agreement void.
  • a legal document may be seen as a document that can have a legal impact, such as a document which expresses a duty, obligation or right.
  • Examples of legal document may include a contract document, a guideline document, an appendix to a contract document, an engagement letter document, a business terms document, a signature document, or any document which expresses a duty, obligation or right.
  • the first document is not a scanned version of the second document.
  • the second document can have a PDF format and/or a docx, format and/or a doc format.
  • the first document comprises an electronic signature.
  • an electronic signature can be an image of a handwritten signature.
  • the first document comprises, for example, the image of the handwritten signature.
  • an electronic signature can be a typewritten signature (e.g., using a keyboard) when an identification of a party signing the first document is performed using biometric data and/or an identification card inserted in a card reader and/or a password prior to signing the first document.
  • the second document is an original document prior to including the signature.
  • the first document is a signed version of the second document, which may include one or more other modifications.
  • the signature indicates that a signing party approves the tabular and/or non-tabular data comprised in the second document (e.g., original document).
  • obtaining S102 the first data comprises determining S102A whether the first document meets a first criterion.
  • determining S102A whether the first document meets the first criterion comprises determining S102AA whether the first document includes a signature. In other words, when the first document includes a signature, the first document meets the first criterion and when the first document does not include a signature, the first document does not meet the first criterion.
  • determining S102A whether the first document meets the first criterion comprises determining S102AB whether the first document is an image-based document. In one or more examples, the first document meets the first criterion when the first document is an image-based document.
  • an image-based document can be a scanned PDF document and/or an image captured by a camera (e.g., converted into a PDF format).
  • the first document does not meet the first criterion when the first document is a text-based document.
  • obtaining S102 the first data comprises, upon the first document meeting the first criterion, applying S102B a first extraction technique to the first document to obtain the first data.
  • the first extraction technique is based on optical extraction.
  • the first data is obtained (e.g., extracted) from the first document by applying the first extraction technique to the first document.
  • the first extraction technique is based on optical extraction.
  • the first extraction technique can be seen as an extraction tool for image-based documents (e.g., optical character recognition, OCR).
  • OCR optical character recognition
  • the first extraction technique enables, in some examples, extracting coordinates associated with the first data (e.g., the location of the first data in a table of the first document), as illustrated in 22 of Fig. 1 B.
  • the first data is obtained from the first document by applying the first extraction technique to the first document.
  • the first document is an image-based document, and the first data is obtained from an optical extraction technique.
  • obtaining S102 the first data comprises, upon the first document not meeting the first criterion, applying S102C a second extraction technique to the first document to obtain the first data.
  • the second extraction technique is based on text mining.
  • the electronic device obtains (e.g., extracts) the first data from the first document by applying the second extraction technique to the first document.
  • the second extraction technique is based on text mining.
  • the second extraction technique can be seen as an extraction tool for text-based documents (e.g., a text miner, such as a python library for mining text).
  • the second extraction technique enables extracting coordinates associated with the first data (e.g., the location of the first data in the first document), as illustrated in 22 of Fig. 1 B.
  • the first data is obtained (e.g., extracted) from the first document by applying the second extraction technique to the first document.
  • the first document meets the first criterion when the first document is a text-based document.
  • the first document is a text-based document with a doc. format and/or a docx format and/or a PDF format.
  • the first data comprises a token, optionally associated with coordinates.
  • a token can be seen as one or more of: a word, a number, and a string.
  • the token can optionally be associated with a coordinate in the document, and/or when part of a table.
  • the first document comprises one or more tokens.
  • the first extraction technique e.g., OCR
  • the second extraction technique e.g., PDF miner
  • obtaining S104 the second data comprises determining S104A, whether the second document is an image-based document. In one or more example methods, obtaining S104 the second data comprises, upon the second document being an image-based document, applying S104B the first extraction technique to the second document to obtain the second data.
  • the second document can be seen as an original document in relation to the first document. For example, the second document does not comprise a signature.
  • the second data is obtained (e.g., extracted) from the second document by applying the first extraction technique to the second document.
  • an image-based document can be a scanned PDF document and/or an image captured by a camera (e.g., converted into a PDF format).
  • the first extraction technique enables extracting coordinates associated with the second data (e.g., the location of the first data in the first document), as illustrated in 32 of Fig. 1B.
  • the first extraction technique enables, for example, extracting coordinates of each token comprised in the second data.
  • obtaining S104 the second data comprises upon the second document not being an image-based document, applying S104C the second extraction technique to the second document to obtain the second data.
  • the second document is a text-based document with a doc. format and/or a docx format and/or a PDF format.
  • the second extraction technique enables extracting coordinates associated with the second data (e.g., the location of the second data in the second document), as illustrated in 32 of Fig. 1 B.
  • the second extraction technique enables, for example, extracting coordinates of each token comprised in the second document and/or data.
  • the first extraction technique and the second extraction technique are applied to the first document and second document for obtaining the first data and the second data.
  • the coordinates associated with the first data extracted by either applying the first extraction technique or the second extraction technique to the first document can optionally be used for inspecting (e.g., determining) a similarity level between the first document (e.g., document for inspection) with the second document (e.g., original document).
  • each token comprised in the first data and associated with given coordinates can optionally be compared with a token associated with the corresponding coordinates in the second data.
  • the tokens are not identical for the same coordinates.
  • applying S106A the textual sequence comparison logic comprises applying S106AA a fuzzy matching logic.
  • the textual comparison logic can be seen as a technique for identifying the one or more modifications between the first document (e.g., comprising a signature) and the second document (e.g., original document).
  • the textual sequence comparison logic comprises a fuzzy matching logic.
  • applying S106AA the fuzzy matching logic comprises determining S106AAA, based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data, a match parameter per token.
  • the match parameter per token indicates for each respective token whether the respective token of the first data matches the corresponding token of the second data.
  • applying S106AA the fuzzy matching logic comprises including S106AAB the match parameter in the modification data. For example, the fuzzy matching logic quantifies (e.g., determines) how similar a respective token of the first data is to a corresponding token of the second data.
  • the fuzzy matching algorithm can identify a token of the first data which partially (e.g., not fully) matches with a token of the second data, e.g., via the similarity parameter disclosed herein.
  • the fuzzy matching logic can be seen as an approximate matching algorithm.
  • the fuzzy matching algorithm allows comparing the first document with the second document at word level.
  • the match parameter per token can be seen as a binary parameter (e.g., “match” or “no match”). Put differently, the match parameter indicates whether the respective token of the first data matches the corresponding token of the second data.
  • the modification data comprises the match parameter.
  • determining S106AAA, based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises determining S106AAAA a similarity parameter between the respective token of the first data and the corresponding token of the second data.
  • identifying whether the respective token of the first data matches the corresponding token of the second data includes inspecting a level of similarity between the respective token of the first data and the corresponding token of the second data.
  • the similarity parameter is indicative of the level of similarity between the respective token of the first data and the corresponding token of the second data.
  • a respective token of the first data (e.g., a word) comprises five characters (e.g., letters), in which only four characters of the first data are identical to the five characters of the second data (e.g., comprised in the second document, such as the original document).
  • a similarity parameter e.g., a similarity level
  • a respective token of the first data is similar to a corresponding token of the second data by, for example, 80%.
  • determining S106AAA, based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises determining S106AAAB the match parameter as indicative of a match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is above a threshold.
  • the respective token of the first data matches the corresponding token of the second data when the similarity parameter is above the threshold.
  • the threshold can be seen as a similarity level above which the respective token of the first data is considered matching the corresponding token of the second data (e.g., the match parameter is a “Match” as illustrated in 38 of Fig. 1 B).
  • the threshold can be 90% and/or 95% and/or any other suitable threshold.
  • a threshold is of 95%.
  • a respective token of the first data is similar to a corresponding token of the second data by 98% (e.g., a similarity parameter is of 98%).
  • the respective token of the first data matches the corresponding token of the second data as the similarity parameter (e.g., 98%) is above the threshold (e.g., 95%).
  • the threshold imposes a minimum level of similarity to have the respective token of the first data matching the corresponding token of the second data.
  • the respective token of the first data matches, based on the threshold and the similarity parameter, the corresponding token of the second data.
  • determining S106AAA, based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises determining S106AAAC the match parameter as indicative of no match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is equal or below the threshold.
  • the respective token of the first data does not match the corresponding token of the second data when the similarity parameter is equal or below the threshold.
  • a threshold is of 95%.
  • a respective token of the first data is similar to a corresponding token of the second data by 80% (e.g., a similarity parameter is of 80%).
  • the respective token of the first data does not match the corresponding token of the second data as the similarity parameter (e.g., 80%) is below the threshold (e.g., 95%).
  • the match parameter is a “No Match” as illustrated in 38 of Fig. 1 B.
  • the method 100 comprises generating S105 a data frame by concatenating S105A at least two of: the first data, the second data, and the match parameter.
  • the data frame e.g., data frame 2 of Fig. 1B
  • the data frame can be seen as a data structure comprising the first data, which is extracted from the first document (e.g., as illustrated in 14 of Fig. 1A), and the second data, which is extracted from the second document (e.g., as illustrated in 16 of Fig. 1A).
  • the second document is the original version of a document.
  • the first document comprises a signature and is based on the second document.
  • the second document does not comprise a signature.
  • the data frame is a concatenation of the first data, the second data and the match parameter.
  • the first data comprises a first location data indicative of a location of a token in the first document.
  • the first location data e.g., as illustrated in 14A of Fig. 1B
  • the second data comprises a second location data indicative of a location of a token in the second document.
  • the second location data e.g., as illustrated in 16A of Fig. 1 B
  • the second extraction techniques is obtained by applying either the first or the second extraction techniques to the second document.
  • the first and the second location data comprises one or more of: a page of the first document or the second document, a coordinate of a token, an index of a table (e.g., a table that comprises the tabular data), and a row of the table.
  • the data frame is, for example, in form of a table or matrix for comparing the first document with the second document (e.g., for comparing the first data with the second data).
  • the first data and the second data are extracted in a sequential manner to maintain consistency in the sequence, allowing to detect the boundary of each cell of a table.
  • applying the fuzzy matching logic to the first document and the second document can be seen resulting in the generation of the data frame (e.g., as initialising and/or populating a match column).
  • the data frame comprises the match column.
  • the match column comprises the match parameter (e.g., “Match” or “No Match”) for each token comprised in the first data and the second data (as illustrated in Fig. 1 B).
  • applying S106A the textual sequence comparison logic comprises applying S106AB a longest common subsequence technique to the data frame for providing a modification type.
  • the textual sequence comparison logic comprises a longest common subsequence, LCS, technique.
  • LCS determines the longest subsequence in two sequences of tokens (e.g., sequences of words) in a same order.
  • the LCS technique is applied to the data frame obtained in S105.
  • applying S106AB the longest common subsequence technique to the data frame for providing the modification type comprises determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type.
  • the modification type is indicative of a type of modification between the respective token of the first data and the corresponding token of the second data.
  • applying S106AB the longest common subsequence technique to the data frame for providing the modification type comprises including S106ABB the modification type in the modification data.
  • the modification data can include the modification type of a respective token.
  • the LCS identifies the one or more modifications between the first document (e.g., comprising a signature) and the second document (e.g., original document) by comparing a respective sequence of tokens (e.g., one or more tokens, such as one or more words) of the first document with a corresponding sequence of tokens of the second document.
  • the modification data is indicative of one or more modifications.
  • the modification data comprises the modification type.
  • comparing the respective sequence of tokens of the first document with the corresponding sequence of tokens of the second document allows determining a modification type.
  • the modification type is indicative of a type of modification in a part of the first document.
  • comparing the first document (e.g., comprising a signature) with the second document (e.g., original document) at a sequence of tokens level can reduce time for detecting the type of modification in a part of the first document.
  • comparing the respective sequence of tokens of the first document with the corresponding sequence of tokens of the second document comprises identifying, based on the match parameter (e.g., obtained by applying the fuzzy matching logic and illustrated in 38 of Fig. 1 B), the LCS of the first and the second document by progressing from top to bottom of the data frame for each page (e.g., illustrated as 18, 28 in Fig. 1 B) and each table (e.g., illustrated as 24, 34 in Fig. 1 B) of the first document and the second document.
  • the match parameter e.g., obtained by applying the fuzzy matching logic and illustrated in 38 of Fig. 1 B
  • the LCS of the first and the second document by progressing from top to bottom of the data frame for each page (e.g., illustrated as 18, 28 in Fig. 1 B) and each table (e.g., illustrated as 24, 34 in Fig. 1 B) of the first document and the second document.
  • an index identifies a table.
  • the LCS is applied (e.g., by the electronic device) to the data frame when the respective token of the first data does not match the corresponding token of the second data (e.g., an occurrence of a “No Match”) so as to identify the type of modification carried out.
  • an occurrence of a “No Match” indicates a modification in the first document in relation to the second document.
  • an occurrence of a “No Match” can be seen as a mismatch between the first document in relation to the second document.
  • determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data comprises iteratively determining S106ABAA a first longest common subsequence between the respective token of the first data and the corresponding token of the second data.
  • determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data comprises iteratively performing S106ABAB a correction in the respective token of the first data.
  • the correction is one or more of: a shift, an alteration, an addition, and a deletion.
  • determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data comprises iteratively determining S106ABAC a second longest common subsequence between the respective corrected token of the first data and the corresponding token of the second data, e.g., after performing the correction.
  • determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data the modification type comprises iteratively comparing S106ABAD the first longest common subsequence with the second longest common subsequence.
  • determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively determining S106ABAE whether the comparison meets a second criterion. For example, the comparison meets the second criterion when the match parameter indicates a “Match” (e.g., in the match column comprised in the data frame) for each respective token of the first data and each corresponding token of the second data.
  • the LCS and a correction are iteratively applied (e.g., by the electronic device) to the data frame for each respective token of the first data associated with a match parameter indicating “No Match” until a match is found for each corresponding token of the second data.
  • the LCS is, for example, iteratively applied (e.g., by the electronic device) to the data frame until the match parameter indicates a “Match” (e.g., in the match column comprised in the data frame) for each respective token of the first data and each corresponding token of the second data.
  • a respective token of the first data is compared with a corresponding token of the second data by applying the LCS for provision of a first longest common subsequence.
  • the first LCS has a length.
  • the first LCS is a number of tokens (e.g., words) comprised in a part of the first data.
  • the respective token of the first data is comprised in such part of the first data.
  • the first LCS is the number of words common to the first data and the second data in a same order.
  • a correction is performed in the respective token of the first data.
  • the respective token (e.g., a word) of the first data can be corrected by shifting upwards, or downwards a respective sequence of tokens (e.g., a string, such as comprising one or more words) of the first data by a length until the respective sequence of tokens starts matching a corresponding sequence of tokens of the second data.
  • the respective token of the first data can be corrected by shifting the respective token of the first data in the data frame.
  • the respective token of the first data can, for example, be corrected by changing position of the respective token of the first data in the data frame (e.g., which results in a change in the position of the respective token of the first data in the first document). This may be indicative of a modification that is a deletion or an addition.
  • determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises, upon the comparison meeting the second criterion, determining S106ABAF the modification type based on one or more corrections.
  • the corrections can include downward shifting, upward shifting, an alteration, an addition, and a deletion.
  • the modification type is a deletion type.
  • the correction is an addition.
  • the data frame comprises an empty cell associated with the first data (as illustrated in 14 of Fig. 1B).
  • a respective token of the first data has been deleted from the first document in relation to corresponding token of the second data.
  • the first data is upward shifted by a length for provision of a first respective adjusted token of the first data.
  • the respective token of the first data can be corrected by adding the respective deleted token of the first data.
  • the respective deleted token is corrected by adding to the first data based on the corresponding token of the second data so as to identify that the modification is a deletion.
  • the modification type is an addition type
  • the correction is a deletion
  • the data frame comprises an empty cell associated with the second data.
  • a respective token of the first data has been added to the first document in relation to the corresponding token of the second data.
  • the respective added token is corrected by deleting it from the first data until match with the second data so as to identify that the modification is an addition.
  • the correction is an alteration.
  • a respective token of the first data has been modified in relation to the corresponding token of the second data.
  • modifying a respective token of the first data can comprise modifying the corresponding token (e.g., from the second document) to a different token or modifying the corresponding token to a similar token (e.g., a token modified beyond a threshold).
  • the threshold for calculating the similarity used for comparing the sequence can be for example be the same or a different threshold used for the fuzzy matching algorithm.
  • the first respective adjusted token of the first data is compared with the corresponding token of the second data by applying the LCS for provision of a second longest common subsequence.
  • the second LCS has a length.
  • the second LCS is a number of tokens (e.g., words) comprised in a part of the first data.
  • the first respective adjusted token of the first data is comprised in such part of the first data.
  • the second LCS is the number of words common to the first adjusted first data and the second data in a same order.
  • the first LCS is compared the second LCS for identifying the modification type.
  • the modification type comprises one or more of: a deletion type, an addition type, and an alteration type.
  • the modification type includes a deletion type.
  • the deletion type can be seen as, for example, deleting a part (e.g., a token and/or a sequence of tokens) of the first document (e.g., comprising a signature) in relation to the second document (e.g., the original document).
  • the modification type includes an addition type.
  • the addition type can be seen as, for example, adding new data (e.g., a token and/or a sequence of tokens) to the first document in relation to the second document.
  • the modification type includes an alteration type.
  • the alteration type can be seen as, for example, modifying a part (e.g., a token and/or a sequence of tokens) of the first document in relation to the second document.
  • providing S108 the output comprises displaying S108A, on a display device (e.g. a display device of the electronic device), a user interface object representative of the output.
  • displaying the user interface object includes highlighting part of the first document that was modified, as illustrated in Fig. 4.
  • the output is the data frame adjusted based on the modification data.
  • the output includes, for example, the data frame resulting from the correction of a mismatch between a respective token of the first data and a corresponding token of the second data.
  • a mismatch can be seen, for example, as an incorrect shift of a respective token of the first data and a corresponding token of the second data.
  • the respective token of the first data and the corresponding token of the second data are not aligned in the data frame (e.g., the respective token of the first data and the corresponding token of the second data are not placed in a same row.
  • the first document and/or the second document can be pre- processed to reduce noise, e.g., applying filters to the data frame, e.g., in the fuzzy matching logic.
  • filters such as cases having only one character difference (such as differences occurring “$ “with an “S”, “I” with an “I”,) and cases where a word is broken into two or more words in one of the documents.
  • Fig. 6 shows a block diagram of an example electronic device 300 according to the disclosure.
  • the wireless device 300 comprises a memory 301 , a processor 302, and an interface 303.
  • the electronic device 300 may be configured to perform any of the methods disclosed in Figs. 5A-5B. In other words, the electronic device 300 may be configured for detection of modification in a first document.
  • the electronic device 300 is configured to obtain (e.g., via the interface 303 and/or using the memory 301) first data indicative of tabular data and/or non-tabular data from the first document.
  • the electronic device 300 is configured to obtain (e.g., via the interface 303 and/or using the memory 301) second data indicative of tabular data and/or non-tabular data from a second document.
  • the electronic device 300 is configured to determine (e.g., using the processor 302) modification data by applying a textual sequence comparison logic to the first data and the second data.
  • the electronic device 300 is configured to provide (e.g., using the processor 302 and/or the interface 303), based on the modification data, an output.
  • the electronic device 300 is optionally configured to perform any of the operations disclosed in Figs. 5A-5B (such as any one or more of: S102A, S102AA, S102AB, S102B, S102C, S104A, S104B, S104C, S105, S105A, S106AA, S106AAA, S106AAAA, S106AAAB, S106AAAC, S106AAB, S106AB, S106ABA, S106ABAB, S106ABAC, S106ABAD, S106ABAE, S106ABAF, S106ABB, S108A).
  • the operations of the electronic device 300 may be embodied in the form of executable logic routines (e.g., lines of code, software programs, etc.) that are stored on a non-transitory computer readable medium (e.g., the memory 301) and are executed by the processor 302.
  • executable logic routines e.g., lines of code, software programs, etc.
  • the operations of the electronic device 300 may be considered a method that the electronic device 300 is configured to carry out. Also, while the described functions and operations may be implemented in software, such functionality may as well be carried out via dedicated hardware or firmware, or some combination of hardware, firmware and/or software.
  • the memory 301 may be one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, a random access memory (RAM), or other suitable device.
  • the memory 301 may include a non-volatile memory for long term data storage and a volatile memory that functions as system memory for the processor 302.
  • the memory 301 may exchange data with the processor 302 over a data bus. Control lines and an address bus between the memory 301 and the processor 302 also may be present (not shown in Fig. 6).
  • the memory 301 is considered a non-transitory computer readable medium.
  • the memory 301 may be configured to store the first data, the second data, the modification data (e.g., including a match parameter, a modification type), a data frame, and/or the output in a part of the memory.
  • the modification data e.g., including a match parameter, a modification type
  • Embodiments of methods and products (electronic device) according to the disclosure are set out in the following items:
  • Item 1 A method, performed by an electronic device, for detection of modification in a first document, the method comprising: obtaining (S102) first data indicative of tabular data and/or non-tabular data from the first document; obtaining (S104) second data indicative of tabular data and/or non-tabular data from a second document; determining (S106) modification data by applying (S106A) a textual sequence comparison logic to the first data and the second data; and providing (S108), based on the modification data, an output.
  • Item 2 The method according to item 1 , wherein the first document is based on the second document by including a signature.
  • Item 3 The method according to item 2, wherein the signature is one of a physical signature and an electronic signature.
  • Item 4 The method according to any of items 2-3, wherein the second document is an original document prior to including the signature.
  • Item 5 The method according to any of the previous items, wherein obtaining (S102) the first data comprises: determining (S102A) whether the first document meets a first criterion; upon the first document meeting the first criterion, applying (S102B) a first extraction technique to the first document to obtain the first data, wherein the first extraction technique is based on optical extraction; and upon the first document not meeting the first criterion, applying (S102C) a second extraction technique to the first document to obtain the first data, wherein the second extraction technique is based on text mining.
  • determining (S102A) whether the first document meets the first criterion comprises determining (S102AA) whether the first document includes a physical signature.
  • determining (S102A) whether the first document meets the first criterion comprises determining (S102AB) whether the first document is an image-based document.
  • Item 8 The method according to any of the previous items, wherein the first document does not meet the first criterion when the first document is a text-based document.
  • Item 9 The method according to any of the previous items, wherein the first data comprises a token.
  • Item 10 The method according to any of the previous items, wherein obtaining (S104) the second data comprises: determining (S104A), whether the second document is an image-based document; upon the second document being an image-based document, applying (S104B) the first extraction technique to the second document to obtain the second data; and upon the second document not being an image-based document, applying (S104C) the second extraction technique to the second document to obtain the second data.
  • Item 12 The method according to item 11 , wherein applying (S106AA) the fuzzy matching logic comprises: determining (S106AAA), based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data, a match parameter per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data; and including (S106AAB) the match parameter in the modification data.
  • applying (S106AA) the fuzzy matching logic comprises: determining (S106AAA), based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data, a match parameter per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data; and including (S106AAB) the match parameter in the modification data.
  • determining (S106AAA), based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises: determining (S106AAAA) a similarity parameter between the respective token of the first data and the corresponding token of the second data; and determining (S106AAAB) the match parameter as indicative of a match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is above a threshold.
  • Item 14 The method according to any of items 12-13, wherein determining (S106AAA), based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises: determining (S106AAAC) the match parameter as indicative of no match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is equal or below the threshold.
  • Item 15 The method according to any of items 12-14, the method comprising generating (S105) a data frame by concatenating (S105A) at least two of: the first data, the second data, and the match parameter.
  • Item 16 The method according to item 15, wherein applying (S106A) the textual sequence comparison logic comprises applying (S106AB) a longest common subsequence technique to the data frame for providing a modification type.
  • Item 17 The method according to item 16, wherein applying (S106AB) the longest common subsequence technique to the data frame for providing the modification type comprises: determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type indicative of a type of modification between the respective token of the first data and the corresponding token of the second data; and including (S106ABB) the modification type in the modification data.
  • determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively: determining (S106ABAA) a first longest common subsequence between the respective token of the first data and the corresponding token of the second data; performing (S106ABAB) a correction in the respective token of the first data; wherein the correction is one or more of: a shift, an alteration, an addition, and a deletion.
  • Item 19 The method according to item 18, wherein determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises: upon the comparison meeting the second criterion, determining (S106ABAF) the modification type based on one or more corrections.
  • Item 20 The method according to any of items 16-19, wherein the modification type comprises one or more of: a deletion type, an addition type, and an alteration type.
  • Item 21 The method according to any of the previous items, wherein providing (S108) the output comprises displaying (S108A), on a display device, a user interface object representative of the output.
  • Item 22 An electronic device comprising a memory, a processor, and an interface, wherein the electronic device is configured to perform any of the methods according to items 1-21.
  • Item 23 A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods of items 1-21 .
  • first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not imply any particular order, but are included to identify individual elements.
  • the use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not denote any order or importance, but rather the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used to distinguish one element from another.
  • the words “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used here and elsewhere for labelling purposes only and are not intended to denote any specific spatial or temporal ordering.
  • the labelling of a first element does not imply the presence of a second element and vice versa.
  • Figs. 1A-6 comprises some circuitries or operations which are illustrated with a solid line and some circuitries or operations which are illustrated with a dashed line.
  • the circuitries or operations which are comprised in a solid line are circuitries or operations which are comprised in the broadest example embodiment.
  • the circuitries or operations which are comprised in a dashed line are example embodiments which may be comprised in, or a part of, or are further circuitries or operations which may be taken in addition to the circuitries or operations of the solid line example embodiments. It should be appreciated that these operations need not be performed in order presented. Furthermore, it should be appreciated that not all of the operations need to be performed.
  • the exemplary operations may be performed in any order and in any combination.
  • a computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), compact discs (CDs), digital versatile discs (DVD), etc.
  • program circuitries may include routines, programs, objects, components, data structures, etc. that perform specified tasks or implement specific abstract data types.
  • Computer-executable instructions, associated data structures, and program circuitries represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Security & Cryptography (AREA)
  • Artificial Intelligence (AREA)
  • Bioethics (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Computer Hardware Design (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Character Input (AREA)

Abstract

Disclosed is a method, performed by an electronic device, for detection of modification in a first document. The method includes obtaining first data indicative of tabular data and/or non-tabular data from the first document. The method includes obtaining second data indicative of tabular data and/or non-tabular data from a second document. The method includes determining modification data by applying a textual sequence comparison logic to the first data and the second data. The method includes providing, based on the modification data, an output.

Description

A METHOD FOR DETECTION OF MODIFICATION IN A FIRST DOCUMENT AND RELATED ELECTRONIC DEVICE
The present disclosure pertains to the field of electronic document control and management. The present disclosure relates to a method for detection of modification in a first document and a related electronic device and related electronic device.
BACKGROUND
Companies handle huge number of clients on a long-term or short-term basis, and generally create legal documents, e.g., contracts formalizing agreements with clients. Contracts have digital signature provision for a party to sign. However, in some countries where a physical signature is the only way to sign a contract, one party signs the contract manually and sends it back to the other party. These contracts might get tampered by one of the parties during signing, such as changing or deleting or adding any part of the contracts, such as a part of one or more clauses. This is detrimental to the handling of the contracts.
SUMMARY
It is time-consuming and error-prone to manually check legal documents for tampering or modification during the signature and/or execution process.
Accordingly, there is a need for an electronic device and a method for detection of modification in a first document, which mitigate, alleviate, or address the shortcomings existing and provides a more efficient and robust authentication and processing of documents, as well as an early detection of any modification.
Disclosed is a method, performed by an electronic device, for detection of modification in a first document. The method comprises obtaining first data indicative of tabular data and/or non-tabular data from the first document. The method comprises obtaining second data indicative of tabular data and/or non-tabular data from a second document. The method comprises determining modification data, e.g., by applying a textual sequence comparison logic to the first data and the second data. The method optionally comprises providing, based on the modification data, an output. Disclosed is an electronic device. The electronic device comprises a memory, a processor, and an interface. The electronic device is configured to perform any of the methods disclosed herein.
Disclosed is a computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods disclosed herein.
It is an advantage of the present disclosure that the disclosed electronic device and method provide a more efficient and robust authentication and processing of documents, as well as an early detection of any modification. Further, the disclosed electronic device and method are applicable to various types of documents (such as text documents, Tabular documents in any format, such as PDF, PS, Word, Pages etc.) to identify and automatically detect a modification caused by a party. The disclosed electronic device and method provide flexibility in the type of files and a solution in multiple domains dealing with contractual documents across an organization.
BRIEF DESCRIPTION OF THE DRAWINGS
The above and other features and advantages of the present disclosure will become readily apparent to those skilled in the art by the following detailed description of exemplary embodiments thereof with reference to the attached drawings, in which:
Fig. 1 A is a diagram illustrating schematically an example representation of document where a modification took place,
Fig. 1 B is a diagram illustrating schematically example data frame according to this disclosure,
Fig. 2 is a diagram illustrating schematically an example process for generating a data frame according to this disclosure,
Fig. 3 is a diagram illustrating schematically an example process for detecting a modification type according to this disclosure,
Fig. 4 is a diagram illustrating schematically an example representation of a second document (such as an original document) and an example representation of a first document (such as a signed document where a modification took place) according to this disclosure, Figs. 5A-5B is a flow-chart illustrating an exemplary method, performed by an electronic device, for detection of modification in a first document according to this disclosure, and Fig. 6 is a block diagram illustrating an exemplary electronic device according to this disclosure.
DETAILED DESCRIPTION
Various exemplary embodiments and details are described hereinafter, with reference to the figures when relevant. It should be noted that the figures may or may not be drawn to scale and that elements of similar structures or functions are represented by like reference numerals throughout the figures. It should also be noted that the figures are only intended to facilitate the description of the embodiments. They are not intended as an exhaustive description of the disclosure or as a limitation on the scope of the disclosure. In addition, an illustrated embodiment needs not have all the aspects or advantages shown. An aspect or an advantage described in conjunction with a particular embodiment is not necessarily limited to that embodiment and can be practiced in any other embodiments even if not so illustrated, or if not so explicitly described.
The figures are schematic and simplified for clarity, and they merely show details which aid understanding the disclosure, while other details have been left out. Throughout, the same reference numerals are used for identical or corresponding parts.
The present disclosure enables the processing and authentication of documents, such as legal documents.
A document (e.g., a first document and/or a second document) disclosed herein may be seen as an electronic document, such as a document that can be processed by a computing device (e.g., by the disclosed electronic device). A document (e.g., a first document and/or a second document) disclosed herein may comprise tabular and/or non- tabular data. For example, tabular data can be seen as data (e.g., text which can comprise tokens and/or strings and/or words) arranged in a table (e.g., arranged by columns and rows). For example, non-tabular data may be seen as data (e.g., text which can comprise tokens and/or strings and/or words) arranged in paragraphs. A legal document, such as a contract, can include one or more clauses. A clause dictates certain conditions under which parties to a contract agree to act during the term of the contract. Hence, liability and/or obligation depends on how each clause is written in a contract. Any modification of a legal document needs to be detected. A tamper detection utility is hereby disclosed to authenticate legal documents (such as a contract). The tamper detection utility can be seen as a software that detects any modification of a legal document with respect to a previous version. The tamper detection utility can authenticate the integrity of a document and let the document proceed to be uploaded on a database upon successful authentication.
The present disclosure allows performing a comparison between plurality of PDF containing tables which can be scanned or text files, depending on how the document is signed. The present disclosure permits determining which lines of a document (such as a clause of a contract document) have been modified and highlighting the modification. The modification can then be reviewed and sent back to the signing party, if required. The present disclosure provides an automation boost and allows a reduction in processing documents, e.g., when compared to the manual comparison.
The present disclosure provides a standardization of extraction procedures which leads to supporting a variety of PDF and docx documents irrespective of their format and template. The present disclosure provides a generalization required around comparison technique. Documents can be scanned and/or non-standard PDF’s. The noise varies across each type of documents, giving rise to many different challenges and scenarios for Optical Character Recognition, OCR, to handle during extraction. This can require an accurate working of image processing techniques, such as brightness, skewness, and sharpness corrections.
The disclosed technique involves, inter alia, the text extraction at a word level and a top- down approach for extraction of sequence of words. To be able to process tables in between text provides some complexity in pipeline as the text needs to be extracted at a cell level sequentially, which in turn demands a very efficient table detection algorithm. Any misalignment of extraction sequence between the two documents leads to false positives in identifying the tamper cases. The present disclosure provides a technique that handles tabular data. The present disclosure provides a technique for detecting a modification between a first document (such as a returned version of a document, such as a signed version of a document) and a second document (such as an original version of the document). The disclosed technique allows detecting, based on the textual sequence comparison logic, a modification (such as addition and/or alteration and/or deletion) of tabular and/or non- tabular data in the first document in respect to the second document. In other words, the present disclosure allows comparing a first document with a second document. For example, the disclosed technique can be particularly important for preventing document modification, such as document tampering (e.g., legal document tampering). In some examples, the disclosed technique can be seen as a document tampering detection technique.
A second document disclosed herein may be seen as an original version of a first document. For example, the first document is the second document with one or more modifications. Put differently, the first document may comprise modified tabular data and/or modified non-tabular data in relation to the second document. In some examples, the first document may comprise a signature in relation to the second document. The one or more modifications can, for example, result from one or more of: an addition of new tabular data and/or non-tabular data, an alteration of the tabular data and/or non-tabular data, and a deletion of the tabular data and/or non-tabular data. In some examples, the first document is identical to the second document. Put differently, in some examples, the first document may be based on the second document when no modification is detected by the electronic device.
The first document and/or the second document disclosed herein can be of any format. The format may be one or more of: a PDF format, a docx, format, a doc format, and a scanned version of a PDF document, among others. The first document and/or the second document disclosed herein can have any template. Put differently, the first document and/or the second document disclosed herein may not necessarily follow a predetermined structure (e.g., a legal document which comprises clauses).
A textual sequence comparison logic may be seen as a technique to identify one or more modifications between a first and a second document. Put differently, the textual sequence comparison logic may detect differences between two versions of a document, such as between a first document and a second document, where the second document may be an original version of the first document.
Fig. 1 A shows a diagram 1 illustrating schematically an example representation of document where a modification took place. The document can be a text-based document, including tabular data and/or non-tabular data. The document can be a PDF file, a scanned version of a PDF document. Optionally, the document can comprise a signature. Fig. 1A shows a representation 14 of a first document, and a representation 16 of a second document. The second document is an original document, while the first document is a returned version of the second document bearing a signature illustrated by 12. The present disclosure provides a technique to detect any modification between the first document and the second document. The disclosed method is performed by an electronic device for detection of modification in the first document 14. Fig. 1A shows that a modification may have occurred when comparing element 10, which may lead to a potential “No match” indicator for the element 10.
In the disclosed method, first data indicative of tabular data and/or non-tabular data is obtained from the first document (such as first document 14). Second data indicative of tabular data and/or non-tabular data is obtained from a second document (such as second document 16). Modification data is determined, e.g., by applying a textual sequence comparison logic to the first data and the second data. The method optionally comprises providing, based on the modification data, an output. For example, a match parameter per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data is determined based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data. The match parameter can be included in the modification data.
A data frame is for example generated by concatenating at least two of: the first data, the second data, and the match parameter. A data frame can be seen as a data structure representing the extraction of the data from the unsigned (e.g., original and/or truth) second document and from the signed first document, and corresponding match parameters. The data frame is for example in form of a table and/or a matrix.
Fig. 1 B shows a diagram illustrating schematically example data frame 2 according to this disclosure. The data frame 2 comprises first data 14A, second data 16A and the corresponding set 38 of match parameters.
The first data 14A indicative of tabular data and/or non-tabular data is obtained from the first document (such as first document 14 of Fig. 1A). The second data 16A indicative of tabular data and/or non-tabular data is obtained from a second document (such as second document 16 of Fig. 1A).
Modification data is determined, e.g., by applying a textual sequence comparison logic to the first data 14A and the second data 16A. For example, a match parameter (shown in column 38) per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data is determined based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data. The match parameter can be included in the modification data.
The first data 14A includes for example page numbers in column 18, corresponding token in column 20, corresponding coordinates of a token of the first document in column 22, a table index (e.g., index) in column 24 showing in which table of the first document the token is present, and/or a table row index in column 26 showing in which row of a corresponding table of the first document the token is present.
The second data 16A includes for example page number in column 28, corresponding token in column 30, corresponding coordinates of a token of the second document in column 32, a table index (e.g., index) in column 34 showing in which table of the second document the token is present, and/or a table row index in column 36 showing in which row of a corresponding table of the second document the token is present.
The match parameter of column 38 indicates “Match” when a similarity parameter between the respective token of the first data and the corresponding token of the second data is above a threshold. The match parameter of column 38 indicates “No Match” when a similarity parameter between the respective token of the first data and the corresponding token of the second data is not above a threshold.
Fig. 2 shows a diagram illustrating schematically an example process 500 for generating a data frame according to this disclosure. Fig. 2 shows a process 500 for extracting first data from a first document 502 (e.g., a document with a signature) and/or second data from a second document 508 (e.g., an original version of the first document 502), for detection of modification data (e.g., one or more modifications) in the first document 502 with respect to the second document 508. The process 500 can be carried out by the electronic device disclosed herein
For example, the first document 502 can be an image-based document (e.g., a PDF document and/or a scanned version of a PDF document and/or an image captured by a camera, such as converted into a PDF format). For example, the first document 502 can be a text-based document (e.g., a PDF document and/or a docx document and/or a doc document).
For example, the second document 508 can be an image-based document (e.g., a PDF document and/or a scanned version of a PDF document and/or an image captured by a camera, such as converted into a PDF format). For example, the second document 508 can be a text-based document (e.g., a PDF document and/or a docx document and/or a doc document). For example, the second document 508 can be document 16 of Fig. 1A.
The process 500 includes a check on whether the first document 502 meets the first criterion in 504. The first criterion 504 is met by the first document 502 when the first document 502 includes a signature. For example, the first document 502 is based on the second document 508 by including a signature (e.g., signature 12 of Fig. 1A). For example, the signature is one of a physical signature (e.g., a wet signature and/or a pen- and-ink signature) and an electronic signature (e.g., a handwritten signature and/or typewritten signature). For example, the first document 502 can be document 14 of Fig. 1A.
For example, the first document 502 meets the first criterion 504 when the first document 502 includes a signature, such as a physical signature. Put differently, the first document 502 meets, in some examples, the first criterion 504 when the first document 502 is an image-based document (e.g., the first document 502 is a scanned version of the second document 508). For example, upon determining that the first document 502 includes the physical signature, the first data 512a (e.g., first data 14A of Fig. 1B) is obtained from the first document 502 by applying a first extraction technique 512 to the first document 502. For example, the first document is an image-based document. For example, the first data 512a is obtained from an optical extraction technique, such as an Optical Character Recognition, OCR, technique. In some examples, the first extraction technique enables extracting coordinates associated with the first data 512a. For example, a data frame 514 comprises the first data 512a.
When the first document 502 does not meet the first criterion (No branch), a second extraction technique 510 is applied to the first document 502. For example, the first document 502 does not meet the first criterion 504 when the first document 502 is a textbased document. For example, upon determining that the first document 502 is a textbased document, the first data 510a (e.g., first data 14A of Fig. 1 B) is obtained from the first document 502 by applying a second extraction technique 510 to the first document 502. For example, second extraction technique 510 is a text mining technique and the first data 510a is obtained from the text mining technique. In some examples, the second extraction technique enables extracting coordinates associated with the first data 510a. For example, the data frame 514 comprises the first data 510a.
The process 500 includes a check on whether the second document 508 is an imagebased document. For example, upon determining that the second document 508 is an image-based document, the second data 512b (e.g., second data 16A of Fig. 1 B) is obtained from the second document 508 by applying a first extraction technique 512 to the second document 508. For example, the second data 512b is obtained from an optical extraction technique. In some examples, the first extraction technique enables extracting coordinates associated with the second data 512b. For example, the data frame 514 comprises the second data 512b.
The process 500 includes a check on whether the second document 508 is a text-based document. For example, upon determining that the second document 508 is a text-based document (e.g., the second document is not an image-based document), the second data 510b (e.g., second data 16A of Fig. 1 B) is obtained from the second document 502 by applying a second extraction technique 510 to the second document 508. For example, the first data 510b is obtained from a text mining technique. In some examples, the second extraction technique enables extracting coordinates associated with the second data 510b. For example, the data frame 514 comprises the second data 510b.
For example, the data frame 514 (e.g., data frame 2 of Fig. 1 B) is a data structure comprising the first data 512a, 510a, which is extracted from the first document 502 (e.g., as illustrated in 14 of Fig. 1A), and the second data 512b, 510b, which is extracted from the second document 508 (e.g., as illustrated in 16 of Fig. 1A).
Fig. 3 shows a diagram illustrating schematically an example process 600 for detecting a modification type according to this disclosure. Fig. 3 shows how to detect modification and generate modification data (e.g., one or more modifications) indicative of a modification in a first document (e.g., a document with a signature, such as document 502 of Fig. 2) with respect to a second document (e.g., an original version of a document for signature, such as document 508 of Fig. 2). The process 600 can be carried out by the electronic device disclosed herein.
For example, the first document (e.g., first document 14, 502 of Figs. 1A and 2) is based on the second document (e.g., second document 16, 508 of Figs. 1A and 2) with one or more modifications. Put differently, the first document may comprise modified tabular data and/or modified non-tabular data, which are modified with respect to the second document.
For example, a first data (e.g., first data 14A of Fig. 1 B) is obtained from the first document by applying either a first extraction technique (e.g., first extraction technique 512 of Fig. 2) or a second extraction technique (e.g., second extraction technique 510 of Fig. 2 to the first document. For example, a second data (e.g., second data 16A of Fig.
1 B) is obtained from the second document by applying either the first extraction technique or the second extraction technique to the second document.
A data frame 602 can include the first data and/or the second data.
For example, the first document and/or the first data comprises one or more tokens. For example, a token can be seen as one or more of: a word, a number, and a string. In one or more examples, the first document comprises one or more tokens. In one or more examples, the second document and/or the second data comprises one or more tokens.
For example, the electronic device disclosed herein can apply a textual sequence comparison logic 604 to detect the modification data between the first document and the second document. The textual sequence comparison logic can include one or more of: a fuzzy matching logic and a longest common subsequence technique. For example, the fuzzy matching logic can be used to detect if there is a modification between the first document and the second document (as illustrated by MATCH/NO MATCH in Fig. 1 B).
For example, the longest common subsequence technique can be used to determine the type of modification between the first document and the second document.
For example, the fuzzy matching logic can identify a token of the first data which partially (e.g., not fully) matches a token of the second data, such as via a similarity parameter. For example, the similarity parameter indicates the similarity between the respective token of the first data and the corresponding token of the second data, e.g., how similar is a respective token of the first data to the corresponding token of the second data. For example, a respective token of the first data (e.g., a word) comprises five characters (e.g., letters), in which only four characters of the first data are identical to the five characters of the corresponding token of the second data (e.g., comprised in the second document, such as the original document), which leads to a similarity parameter (e.g., a similarity level) of 80%. Put differently, the respective token of the first data is similar to the corresponding token of the second data by, for example, 80%.
For example, the respective token of the first data matches the corresponding token of the second data when the similarity parameter is above a threshold (e.g., no tampering of the respective token of the first data, such as illustrated by 606). For example, the respective token of the first data does not match the corresponding token of the second data when the similarity parameter is equal or below the threshold.
For example, upon applying the fuzzy matching logic to each respective token of the first data and each corresponding token of the second data, a match parameter for each respective token of the first data and for each corresponding token of the second data is added to the data frame 602 (e.g., data frame 514 of Fig. 3). For example, the match parameter indicates for each respective token whether the respective token of the first data matches the corresponding token of the second data. In one or more examples, the match column comprises the match parameter (e.g., “Match” or “No Match”) for each token comprised in the first data and the second data (as illustrated in Fig. 1 B). For example, the fuzzy matching logic initialises and/or populates the match column of the data frame 602 (e.g., data frame 514 of Fig. 2). For example, the longest common subsequence determines the type of modification by comparing a respective sequence of tokens (e.g., one or more tokens, such as one or more words) of the first data with a corresponding sequence of tokens of the second data. For example, the longest common subsequence is applied to the data frame 602. For example, the modification type is indicative of a type of modification in a part of the first document (e.g., a modification type of a respective token).
For example, the longest common subsequence is applied (e.g., by the electronic device) to the data frame when the respective token of the first data does not match the corresponding token of the second data (e.g., an occurrence of a “No Match” in the match column) to determine the type of modification carried out. For example, the respective sequence of tokens of the first data handled by the longest common subsequence technique comprises one or more tokens of the first data (e.g., such respective token of the first data requires a correction). For example, the respective sequence of tokens of the second data comprises one or more tokens of the second data to allow determining the modification type.
For example, the longest common subsequence and a correction are iteratively applied (e.g., by the electronic device) to the data frame for each respective token of the first data associated with a match parameter indicating “No Match”. For example, the longest common subsequence and the correction are iteratively applied to the data frame until each respective token of the first data matches each corresponding token of the second data. In other words, the respective token of the first data is, for example, compared with the corresponding token of the second data for providing a first longest common subsequence. In some examples, the respective token of the first data does not match the corresponding token of the second data. For example, a correction is performed (e.g., by the electronic device) in the respective token of the first data (e.g., comprising a modification) for providing a respective corrected token of the first data.
After the correction is performed, the respective corrected token of the first data is compared with the corresponding token of the second data for providing a second longest common subsequence. For example, the first longest common subsequence is compared with the second longest common subsequence.
The modification type can be determined based on the one or more correction carried out in the respective token of the first data when the comparison meets a second criterion 604A, 604B, 604C.
The comparison between the first longest common subsequence and the second longest common subsequence provides a similarity parameter, which is compared to a threshold, such as 95%. For example, the comparison meets the second criterion when the similarity parameter between the first longest common subsequence and the second longest common subsequence is above the threshold.
For example, the respective corrected token of the first data comprises more tokens (e.g., words) in common with the corresponding token of the second data. In other words, a longer second longest common subsequence in respect to a first longest common subsequence can, for example, be indicative of a higher similarity parameter between the respective token of the first data and the corresponding token of the second data.
However, the similarity parameter needs to satisfy the threshold for the comparison to meet the second criterion.
For example, the respective token (e.g., a word) of the first data can be corrected by shifting upwards 610, or downwards 614 a respective sequence of tokens (e.g., a string, such as comprising one or more words) of the first data by a length until the respective sequence of tokens provides a match parameter indicative of a match between the respective sequence of tokens of the first data and a corresponding sequence of tokens of the second data. The upward shifting correction 610 for example indicates that the modification type is a deletion type 608. The downward shifting correction 614 for example indicates that the modification type is an addition type 612.
In some examples, the comparison does not meet the second criterion when the similarity parameter between the first longest common subsequence and the second longest common subsequence is not above the threshold, such as 95%. In other words, when the when the similarity parameter between the first longest common subsequence and the second longest common subsequence is not above the threshold, the modification type is determined by the electronic device as an alteration type 616.
In some examples, the output 618 of the process 600 can include information indicating one or more modifications, and corresponding attributes (such as match parameter, similarity parameter and/or modification type). For examples, the output 618 can provide an adjusted data frame including a highlight of the one or more modifications, which may be displayed or provided via a user interface object 620 (e.g., representation 704 of Fig. 4).
Fig. 4 shows a diagram 700 illustrating schematically an example representation 702 of a second document (such as an original document) and an example representation 704 of a first document (such as a signed document based on the original document where a modification took place) according to this disclosure. The first document 704 shows a signature 708 and a modification illustrated by 710 where a number which was 50 in the second document 702 is now changed to 10. This is a modification that can be presented to a user in user interface using representation 704.
Figs. 5A-5B shows a flow-chart of an exemplary method 100, performed by an electronic device, for detection of modification in a first document according to the disclosure. The electronic device is the electronic device disclosed herein, such as electronic device 300 of Fig. 6.
The method 100 comprises obtaining S102 first data indicative of tabular data and/or non- tabular data from the first document. For example, the method 100 comprises obtaining S102 first data indicative of at least one of tabular data and non-tabular data from the first document. For example, the method 100 comprises obtaining S102 first data indicative of tabular data and non-tabular data from the first document. In one or more example methods, the first data comprises tabular data and/or non-tabular data from the first document.
The method 100 comprises obtaining S104 second data indicative of tabular data and/or non-tabular data from a second document. In one or more example methods, the second data comprises tabular data and/or non-tabular data from the second document.
The method 100 comprises determining S106 modification data by applying S106A a textual sequence comparison logic to the first data and the second data. The method 100 comprises providing S108, based on the modification data, an output.
In one or more examples, the first document comprises tabular and/or non-tabular data, e.g., in Fig. 1A. In one or more examples, the second document comprises tabular and/or non-tabular data e.g., in Fig. 1 A. For example, tabular data can be data arranged in a table (e.g., arranged in one or more columns and in one or more rows). For example, non- tabular data can be data arranged in paragraphs. For example, non-tabular data can be data that is not in tabular form. Put differently, non-tabular data can be data which is not included in a table. In one or more examples, the second document can be seen as an original version of the first document. For example, the second document is an original document sent to a party for signature while the first document is the second document returned by the party after signature. There may be modifications that took place in the first document, and which can be detected in a more efficient, systematic, and robust manner, and without human intervention by using method 100.
In one or more example methods, the first document is based on the second document by including a signature. In one or more examples, the first document can be detected as based on the second document with one or more modifications. In one or more examples, the first document is a document for inspection, such as for detecting the one or more modifications with respect to the second document. In one or more examples, the first document comprises a signature. For example, including a signature in the first document can be seen as modifying the first document in relation to the second document. For example, the one or more modifications can be indicative of an inclusion of a signature in the first document. In one or more examples, including a signature in the first document can be seen as an acceptable modification of the first document in relation to the second document. In some examples, the signature indicates that a signing party approves and/or accepts one or more clauses comprised in the second document (e.g., original document). For example, the second document is the original document whose involved parties agreed on. In one or more example methods, the signature is one of a physical signature and an electronic signature. In one or more examples, the first document (e.g., document for inspection) is a scanned and signed version of the second document (e.g., original document). For example, the signature can be a physical signature. For example, a physical signature can be a wet signature and/or a pen-and-ink signature (e.g., a handwritten signature).
For example, the first document is the second document with one or more modifications. For example, the modification data can be indicative of the one or more modifications. Put differently, the first document may comprise modified tabular data and/or modified non- tabular data in relation to the second document. In some examples, the first document may be identical to the second document when no modification (e.g., alteration) is detected between the first document and the second document (e.g., by an electronic device). In such examples, the modification data can include a match parameter indicating “Match” for each token.
In one or more examples, the textual sequence comparison logic can detect the modification data (e.g., the one or more modifications) between the first document and the second document). Put differently, the textual sequence comparison logic can, for example, detect differences between two versions of a document, such as between a first document and a second document. The textual sequence comparison logic can include one or more of: a fuzzy matching logic and a longest common subsequence technique.
In one or more examples, the output can be provided to an entity (e.g., a company and/or an individual and/or an institution) so that a tampered document can be immediately detected (e.g., a legal document which is deliberately modified and/or a generic document intended to be inspected for detecting one or more modifications). In other words, the output comprises information indicating one or more modifications, and corresponding attributes (such as match parameter, similarity parameter and/or modification type). For example, the output can include data indicating which tabular data and/or non-tabular data of the first document (e.g., the document under test) differs from corresponding data of the second document (e.g., the original version, and/or the true document).
In one or more examples, the first document can be detected as a tampered document. The first document can be a generic document (e.g., of non-legal nature) which may have been modified with respect to an original document. In some examples, the first document can be a legal document (e.g., the first document) which is deliberately modified with respect to an original document (e.g., the second document). Such deliberate modification can be seen as a forgery of the original document (e.g., the second document), which can deem the original agreement void. For example, a legal document may be seen as a document that can have a legal impact, such as a document which expresses a duty, obligation or right. Examples of legal document may include a contract document, a guideline document, an appendix to a contract document, an engagement letter document, a business terms document, a signature document, or any document which expresses a duty, obligation or right. In one or more examples, the first document is not a scanned version of the second document. For example, the second document can have a PDF format and/or a docx, format and/or a doc format. For example, the first document comprises an electronic signature. For example, an electronic signature can be an image of a handwritten signature. The first document comprises, for example, the image of the handwritten signature. For example, an electronic signature can be a typewritten signature (e.g., using a keyboard) when an identification of a party signing the first document is performed using biometric data and/or an identification card inserted in a card reader and/or a password prior to signing the first document. In one or more example methods, the second document is an original document prior to including the signature.
In one or more examples, the first document is a signed version of the second document, which may include one or more other modifications. For example, the signature indicates that a signing party approves the tabular and/or non-tabular data comprised in the second document (e.g., original document).
In one or more example methods, obtaining S102 the first data comprises determining S102A whether the first document meets a first criterion. In one or more example methods, determining S102A whether the first document meets the first criterion comprises determining S102AA whether the first document includes a signature. In other words, when the first document includes a signature, the first document meets the first criterion and when the first document does not include a signature, the first document does not meet the first criterion. In one or more example methods, determining S102A whether the first document meets the first criterion comprises determining S102AB whether the first document is an image-based document. In one or more examples, the first document meets the first criterion when the first document is an image-based document. In one or more examples, an image-based document can be a scanned PDF document and/or an image captured by a camera (e.g., converted into a PDF format). In one or more example methods, the first document does not meet the first criterion when the first document is a text-based document.
In one or more example methods, obtaining S102 the first data comprises, upon the first document meeting the first criterion, applying S102B a first extraction technique to the first document to obtain the first data. In one or more example methods, the first extraction technique is based on optical extraction. In one or more examples, when the first document meets the first criterion, the first data is obtained (e.g., extracted) from the first document by applying the first extraction technique to the first document. For example, the first extraction technique is based on optical extraction. For example, the first extraction technique can be seen as an extraction tool for image-based documents (e.g., optical character recognition, OCR). In other words, the first extraction technique can convert scanned PDF documents and/or images captured by a digital camera into searchable data. Put differently, the first extraction technique enables, in some examples, extracting coordinates associated with the first data (e.g., the location of the first data in a table of the first document), as illustrated in 22 of Fig. 1 B. In one or more examples, upon determining that the first document includes the signature (e.g., a physical signature (such as a wet signature and/or a pen-and-ink signature) and/or an electronic signature), the first data is obtained from the first document by applying the first extraction technique to the first document. In some examples, the first document is an image-based document, and the first data is obtained from an optical extraction technique.
In one or more example methods, obtaining S102 the first data comprises, upon the first document not meeting the first criterion, applying S102C a second extraction technique to the first document to obtain the first data. In one or more example methods, the second extraction technique is based on text mining. In one or more examples, when the first document does not meet the first criterion, the electronic device obtains (e.g., extracts) the first data from the first document by applying the second extraction technique to the first document. For example, the second extraction technique is based on text mining. For example, the second extraction technique can be seen as an extraction tool for text-based documents (e.g., a text miner, such as a python library for mining text). For example, the second extraction technique enables extracting coordinates associated with the first data (e.g., the location of the first data in the first document), as illustrated in 22 of Fig. 1 B.
In one or more examples, when the first document does not meet the first criterion, the first data is obtained (e.g., extracted) from the first document by applying the second extraction technique to the first document. In one or more examples, the first document meets the first criterion when the first document is a text-based document. In one or more examples, the first document is a text-based document with a doc. format and/or a docx format and/or a PDF format.
In one or more example methods, the first data comprises a token, optionally associated with coordinates. For example, a token can be seen as one or more of: a word, a number, and a string. The token can optionally be associated with a coordinate in the document, and/or when part of a table. In one or more examples, the first document comprises one or more tokens. In one or more examples, the first extraction technique (e.g., OCR) and the second extraction technique (e.g., PDF miner) allows obtaining the coordinates of each token comprised in the first document.
In one or more example methods, obtaining S104 the second data comprises determining S104A, whether the second document is an image-based document. In one or more example methods, obtaining S104 the second data comprises, upon the second document being an image-based document, applying S104B the first extraction technique to the second document to obtain the second data. In one or more examples, the second document can be seen as an original document in relation to the first document. For example, the second document does not comprise a signature. In one or more examples, when the second document is an image-based document, the second data is obtained (e.g., extracted) from the second document by applying the first extraction technique to the second document. In one or more examples, an image-based document can be a scanned PDF document and/or an image captured by a camera (e.g., converted into a PDF format). For example, the first extraction technique enables extracting coordinates associated with the second data (e.g., the location of the first data in the first document), as illustrated in 32 of Fig. 1B. In other words, the first extraction technique enables, for example, extracting coordinates of each token comprised in the second data.
In one or more example methods, obtaining S104 the second data comprises upon the second document not being an image-based document, applying S104C the second extraction technique to the second document to obtain the second data. For example, the second document is a text-based document with a doc. format and/or a docx format and/or a PDF format. For example, the second extraction technique enables extracting coordinates associated with the second data (e.g., the location of the second data in the second document), as illustrated in 32 of Fig. 1 B. In other words, the second extraction technique enables, for example, extracting coordinates of each token comprised in the second document and/or data.
In some examples, the first extraction technique and the second extraction technique are applied to the first document and second document for obtaining the first data and the second data. In some examples, the coordinates associated with the first data extracted by either applying the first extraction technique or the second extraction technique to the first document can optionally be used for inspecting (e.g., determining) a similarity level between the first document (e.g., document for inspection) with the second document (e.g., original document). Put differently, each token comprised in the first data and associated with given coordinates can optionally be compared with a token associated with the corresponding coordinates in the second data. In some examples, the tokens are not identical for the same coordinates.
In one or more example methods, applying S106A the textual sequence comparison logic comprises applying S106AA a fuzzy matching logic. In one or more examples, the textual comparison logic can be seen as a technique for identifying the one or more modifications between the first document (e.g., comprising a signature) and the second document (e.g., original document). In one or more examples, the textual sequence comparison logic comprises a fuzzy matching logic.
In one or more example methods, applying S106AA the fuzzy matching logic comprises determining S106AAA, based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data, a match parameter per token. In one or more example methods, the match parameter per token indicates for each respective token whether the respective token of the first data matches the corresponding token of the second data. In one or more example methods, applying S106AA the fuzzy matching logic comprises including S106AAB the match parameter in the modification data. For example, the fuzzy matching logic quantifies (e.g., determines) how similar a respective token of the first data is to a corresponding token of the second data. In one or more examples, the fuzzy matching algorithm can identify a token of the first data which partially (e.g., not fully) matches with a token of the second data, e.g., via the similarity parameter disclosed herein. In other words, the fuzzy matching logic can be seen as an approximate matching algorithm. In one or more examples, the fuzzy matching algorithm allows comparing the first document with the second document at word level. In one or more examples, the match parameter per token can be seen as a binary parameter (e.g., “match” or “no match”). Put differently, the match parameter indicates whether the respective token of the first data matches the corresponding token of the second data. In one or more examples, the modification data comprises the match parameter.
In one or more example methods, determining S106AAA, based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises determining S106AAAA a similarity parameter between the respective token of the first data and the corresponding token of the second data. In one or more examples, identifying whether the respective token of the first data matches the corresponding token of the second data includes inspecting a level of similarity between the respective token of the first data and the corresponding token of the second data. For example, the similarity parameter is indicative of the level of similarity between the respective token of the first data and the corresponding token of the second data. For example, a respective token of the first data (e.g., a word) comprises five characters (e.g., letters), in which only four characters of the first data are identical to the five characters of the second data (e.g., comprised in the second document, such as the original document). In some examples, a similarity parameter (e.g., a similarity level) is of 80%. Put differently, a respective token of the first data is similar to a corresponding token of the second data by, for example, 80%.
In one or more example methods, determining S106AAA, based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises determining S106AAAB the match parameter as indicative of a match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is above a threshold. In one or more examples, the respective token of the first data matches the corresponding token of the second data when the similarity parameter is above the threshold. In one or more examples, the threshold can be seen as a similarity level above which the respective token of the first data is considered matching the corresponding token of the second data (e.g., the match parameter is a “Match” as illustrated in 38 of Fig. 1 B). In one or more examples, the threshold can be 90% and/or 95% and/or any other suitable threshold. For example, a threshold is of 95%. For example, a respective token of the first data is similar to a corresponding token of the second data by 98% (e.g., a similarity parameter is of 98%). For example, the respective token of the first data matches the corresponding token of the second data as the similarity parameter (e.g., 98%) is above the threshold (e.g., 95%). For example, the threshold imposes a minimum level of similarity to have the respective token of the first data matching the corresponding token of the second data. For example, the respective token of the first data matches, based on the threshold and the similarity parameter, the corresponding token of the second data. In one or more example methods, determining S106AAA, based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises determining S106AAAC the match parameter as indicative of no match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is equal or below the threshold. In one or more examples, the respective token of the first data does not match the corresponding token of the second data when the similarity parameter is equal or below the threshold. For example, a threshold is of 95%. For example, a respective token of the first data is similar to a corresponding token of the second data by 80% (e.g., a similarity parameter is of 80%). For example, the respective token of the first data does not match the corresponding token of the second data as the similarity parameter (e.g., 80%) is below the threshold (e.g., 95%). For example, the match parameter is a “No Match” as illustrated in 38 of Fig. 1 B.
In one or more example methods, the method 100 comprises generating S105 a data frame by concatenating S105A at least two of: the first data, the second data, and the match parameter. In one or more examples, the data frame (e.g., data frame 2 of Fig. 1B) can be seen as a data structure comprising the first data, which is extracted from the first document (e.g., as illustrated in 14 of Fig. 1A), and the second data, which is extracted from the second document (e.g., as illustrated in 16 of Fig. 1A). For example, the second document is the original version of a document. For example, the first document comprises a signature and is based on the second document. For example, the second document does not comprise a signature. In one or more examples, the data frame is a concatenation of the first data, the second data and the match parameter. For example, the first data comprises a first location data indicative of a location of a token in the first document. For example, the first location data (e.g., as illustrated in 14A of Fig. 1B) is obtained by applying either the first or the second extraction techniques to the first document. For example, the second data comprises a second location data indicative of a location of a token in the second document. For example, the second location data (e.g., as illustrated in 16A of Fig. 1 B) is obtained by applying either the first or the second extraction techniques to the second document. For example, for each token comprised in the first document and second document, the first and the second location data comprises one or more of: a page of the first document or the second document, a coordinate of a token, an index of a table (e.g., a table that comprises the tabular data), and a row of the table. The data frame is, for example, in form of a table or matrix for comparing the first document with the second document (e.g., for comparing the first data with the second data). In some examples, the first data and the second data are extracted in a sequential manner to maintain consistency in the sequence, allowing to detect the boundary of each cell of a table. In one or more examples, applying the fuzzy matching logic to the first document and the second document can be seen resulting in the generation of the data frame (e.g., as initialising and/or populating a match column). In one or more examples, the data frame comprises the match column. In one or more examples, the match column comprises the match parameter (e.g., “Match” or “No Match”) for each token comprised in the first data and the second data (as illustrated in Fig. 1 B).
In one or more example methods, applying S106A the textual sequence comparison logic comprises applying S106AB a longest common subsequence technique to the data frame for providing a modification type. In one or more examples, the textual sequence comparison logic comprises a longest common subsequence, LCS, technique. For example, the LCS determines the longest subsequence in two sequences of tokens (e.g., sequences of words) in a same order.
In one or more examples, after applying the fuzzy matching logic to the first document and the second document to generate a first version of the data frame, the LCS technique is applied to the data frame obtained in S105.
In one or more example methods, applying S106AB the longest common subsequence technique to the data frame for providing the modification type comprises determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type. In one or more example methods, the modification type is indicative of a type of modification between the respective token of the first data and the corresponding token of the second data. In one or more example methods, applying S106AB the longest common subsequence technique to the data frame for providing the modification type comprises including S106ABB the modification type in the modification data. In other words, the modification data can include the modification type of a respective token. In one or more examples, the LCS identifies the one or more modifications between the first document (e.g., comprising a signature) and the second document (e.g., original document) by comparing a respective sequence of tokens (e.g., one or more tokens, such as one or more words) of the first document with a corresponding sequence of tokens of the second document. In one or more examples, the modification data is indicative of one or more modifications. For example, the modification data comprises the modification type.
In one or more examples, comparing the respective sequence of tokens of the first document with the corresponding sequence of tokens of the second document allows determining a modification type. For example, the modification type is indicative of a type of modification in a part of the first document. For example, comparing the first document (e.g., comprising a signature) with the second document (e.g., original document) at a sequence of tokens level (e.g., instead of at token level) can reduce time for detecting the type of modification in a part of the first document.
In one or more examples, comparing the respective sequence of tokens of the first document with the corresponding sequence of tokens of the second document comprises identifying, based on the match parameter (e.g., obtained by applying the fuzzy matching logic and illustrated in 38 of Fig. 1 B), the LCS of the first and the second document by progressing from top to bottom of the data frame for each page (e.g., illustrated as 18, 28 in Fig. 1 B) and each table (e.g., illustrated as 24, 34 in Fig. 1 B) of the first document and the second document. In some examples, an index identifies a table. In other words, the LCS is applied (e.g., by the electronic device) to the data frame when the respective token of the first data does not match the corresponding token of the second data (e.g., an occurrence of a “No Match”) so as to identify the type of modification carried out. For example, an occurrence of a “No Match” indicates a modification in the first document in relation to the second document. For example, an occurrence of a “No Match” can be seen as a mismatch between the first document in relation to the second document.
In one or more example methods, determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively determining S106ABAA a first longest common subsequence between the respective token of the first data and the corresponding token of the second data. In one or more example methods, determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively performing S106ABAB a correction in the respective token of the first data. In one or more example methods, the correction is one or more of: a shift, an alteration, an addition, and a deletion. In one or more example methods, determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively determining S106ABAC a second longest common subsequence between the respective corrected token of the first data and the corresponding token of the second data, e.g., after performing the correction. In one or more example methods, determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively comparing S106ABAD the first longest common subsequence with the second longest common subsequence. In one or more example methods, determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively determining S106ABAE whether the comparison meets a second criterion. For example, the comparison meets the second criterion when the match parameter indicates a “Match” (e.g., in the match column comprised in the data frame) for each respective token of the first data and each corresponding token of the second data.
In one or more examples, the LCS and a correction are iteratively applied (e.g., by the electronic device) to the data frame for each respective token of the first data associated with a match parameter indicating “No Match” until a match is found for each corresponding token of the second data. Put differently, the LCS is, for example, iteratively applied (e.g., by the electronic device) to the data frame until the match parameter indicates a “Match” (e.g., in the match column comprised in the data frame) for each respective token of the first data and each corresponding token of the second data. For example, a respective token of the first data is compared with a corresponding token of the second data by applying the LCS for provision of a first longest common subsequence. For example, the first LCS has a length. For example, the first LCS is a number of tokens (e.g., words) comprised in a part of the first data. For example, the respective token of the first data is comprised in such part of the first data. In other words, the first LCS is the number of words common to the first data and the second data in a same order. In one or more examples, when a respective token of the first data does not match a corresponding token of the second data, a correction is performed in the respective token of the first data.
For example, as illustrated in Fig. 3, the respective token (e.g., a word) of the first data can be corrected by shifting upwards, or downwards a respective sequence of tokens (e.g., a string, such as comprising one or more words) of the first data by a length until the respective sequence of tokens starts matching a corresponding sequence of tokens of the second data. For example, the respective token of the first data can be corrected by shifting the respective token of the first data in the data frame. In other words, the respective token of the first data can, for example, be corrected by changing position of the respective token of the first data in the data frame (e.g., which results in a change in the position of the respective token of the first data in the first document). This may be indicative of a modification that is a deletion or an addition.
In one or more example methods, determining S106ABA, based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises, upon the comparison meeting the second criterion, determining S106ABAF the modification type based on one or more corrections. The corrections can include downward shifting, upward shifting, an alteration, an addition, and a deletion.
In one or more examples, the modification type is a deletion type. For example, the correction is an addition. For example, the data frame comprises an empty cell associated with the first data (as illustrated in 14 of Fig. 1B). For example, a respective token of the first data has been deleted from the first document in relation to corresponding token of the second data. In one or more examples, the first data is upward shifted by a length for provision of a first respective adjusted token of the first data. For example, the respective token of the first data can be corrected by adding the respective deleted token of the first data. For example, the respective deleted token is corrected by adding to the first data based on the corresponding token of the second data so as to identify that the modification is a deletion.
In one or more examples, the modification type is an addition type, and the correction is a deletion. For example, the data frame comprises an empty cell associated with the second data. For example, a respective token of the first data has been added to the first document in relation to the corresponding token of the second data. For example, the respective added token is corrected by deleting it from the first data until match with the second data so as to identify that the modification is an addition.
In one or more examples, the correction is an alteration. For example, a respective token of the first data has been modified in relation to the corresponding token of the second data. For example, modifying a respective token of the first data can comprise modifying the corresponding token (e.g., from the second document) to a different token or modifying the corresponding token to a similar token (e.g., a token modified beyond a threshold). In one or more examples, the threshold for calculating the similarity used for comparing the sequence can be for example be the same or a different threshold used for the fuzzy matching algorithm. For example, the first respective adjusted token of the first data is compared with the corresponding token of the second data by applying the LCS for provision of a second longest common subsequence. For example, the second LCS has a length. For example, the second LCS is a number of tokens (e.g., words) comprised in a part of the first data. For example, the first respective adjusted token of the first data is comprised in such part of the first data. In other words, the second LCS is the number of words common to the first adjusted first data and the second data in a same order. In one or more examples, the first LCS is compared the second LCS for identifying the modification type.
In one or more example methods, the modification type comprises one or more of: a deletion type, an addition type, and an alteration type.
In one or more examples, the modification type includes a deletion type. The deletion type can be seen as, for example, deleting a part (e.g., a token and/or a sequence of tokens) of the first document (e.g., comprising a signature) in relation to the second document (e.g., the original document). In one or more examples, the modification type includes an addition type. The addition type can be seen as, for example, adding new data (e.g., a token and/or a sequence of tokens) to the first document in relation to the second document. In one or more examples, the modification type includes an alteration type. The alteration type can be seen as, for example, modifying a part (e.g., a token and/or a sequence of tokens) of the first document in relation to the second document.
In one or more example methods, providing S108 the output comprises displaying S108A, on a display device (e.g. a display device of the electronic device), a user interface object representative of the output. In one or more examples, displaying the user interface object includes highlighting part of the first document that was modified, as illustrated in Fig. 4. In one or more examples, the output is the data frame adjusted based on the modification data. Put differently, the output includes, for example, the data frame resulting from the correction of a mismatch between a respective token of the first data and a corresponding token of the second data. A mismatch can be seen, for example, as an incorrect shift of a respective token of the first data and a corresponding token of the second data. For example, the respective token of the first data and the corresponding token of the second data are not aligned in the data frame (e.g., the respective token of the first data and the corresponding token of the second data are not placed in a same row.
It may be envisaged that the first document and/or the second document can be pre- processed to reduce noise, e.g., applying filters to the data frame, e.g., in the fuzzy matching logic. For example, noisy cases caused by difference in OCR between two documents can be handled by using filters (such as cases having only one character difference (such as differences occurring “$ “with an “S”, “I” with an “I”,) and cases where a word is broken into two or more words in one of the documents.
Fig. 6 shows a block diagram of an example electronic device 300 according to the disclosure. The wireless device 300 comprises a memory 301 , a processor 302, and an interface 303. The electronic device 300 may be configured to perform any of the methods disclosed in Figs. 5A-5B. In other words, the electronic device 300 may be configured for detection of modification in a first document.
The electronic device 300 is configured to obtain (e.g., via the interface 303 and/or using the memory 301) first data indicative of tabular data and/or non-tabular data from the first document.
The electronic device 300 is configured to obtain (e.g., via the interface 303 and/or using the memory 301) second data indicative of tabular data and/or non-tabular data from a second document.
The electronic device 300 is configured to determine (e.g., using the processor 302) modification data by applying a textual sequence comparison logic to the first data and the second data. The electronic device 300 is configured to provide (e.g., using the processor 302 and/or the interface 303), based on the modification data, an output.
The electronic device 300 is optionally configured to perform any of the operations disclosed in Figs. 5A-5B (such as any one or more of: S102A, S102AA, S102AB, S102B, S102C, S104A, S104B, S104C, S105, S105A, S106AA, S106AAA, S106AAAA, S106AAAB, S106AAAC, S106AAB, S106AB, S106ABA, S106ABAB, S106ABAC, S106ABAD, S106ABAE, S106ABAF, S106ABB, S108A). The operations of the electronic device 300 may be embodied in the form of executable logic routines (e.g., lines of code, software programs, etc.) that are stored on a non-transitory computer readable medium (e.g., the memory 301) and are executed by the processor 302.
Furthermore, the operations of the electronic device 300 may be considered a method that the electronic device 300 is configured to carry out. Also, while the described functions and operations may be implemented in software, such functionality may as well be carried out via dedicated hardware or firmware, or some combination of hardware, firmware and/or software.
The memory 301 may be one or more of a buffer, a flash memory, a hard drive, a removable media, a volatile memory, a non-volatile memory, a random access memory (RAM), or other suitable device. In a typical arrangement, the memory 301 may include a non-volatile memory for long term data storage and a volatile memory that functions as system memory for the processor 302. The memory 301 may exchange data with the processor 302 over a data bus. Control lines and an address bus between the memory 301 and the processor 302 also may be present (not shown in Fig. 6). The memory 301 is considered a non-transitory computer readable medium.
The memory 301 may be configured to store the first data, the second data, the modification data (e.g., including a match parameter, a modification type), a data frame, and/or the output in a part of the memory.
Embodiments of methods and products (electronic device) according to the disclosure are set out in the following items:
Item 1 . A method, performed by an electronic device, for detection of modification in a first document, the method comprising: obtaining (S102) first data indicative of tabular data and/or non-tabular data from the first document; obtaining (S104) second data indicative of tabular data and/or non-tabular data from a second document; determining (S106) modification data by applying (S106A) a textual sequence comparison logic to the first data and the second data; and providing (S108), based on the modification data, an output.
Item 2. The method according to item 1 , wherein the first document is based on the second document by including a signature.
Item 3. The method according to item 2, wherein the signature is one of a physical signature and an electronic signature.
Item 4. The method according to any of items 2-3, wherein the second document is an original document prior to including the signature.
Item 5. The method according to any of the previous items, wherein obtaining (S102) the first data comprises: determining (S102A) whether the first document meets a first criterion; upon the first document meeting the first criterion, applying (S102B) a first extraction technique to the first document to obtain the first data, wherein the first extraction technique is based on optical extraction; and upon the first document not meeting the first criterion, applying (S102C) a second extraction technique to the first document to obtain the first data, wherein the second extraction technique is based on text mining.
Item 6. The method according to any of the previous items, wherein determining (S102A) whether the first document meets the first criterion comprises determining (S102AA) whether the first document includes a physical signature. Item 7. The method according to any of the previous items, wherein determining (S102A) whether the first document meets the first criterion comprises determining (S102AB) whether the first document is an image-based document.
Item 8. The method according to any of the previous items, wherein the first document does not meet the first criterion when the first document is a text-based document.
Item 9. The method according to any of the previous items, wherein the first data comprises a token.
Item 10. The method according to any of the previous items, wherein obtaining (S104) the second data comprises: determining (S104A), whether the second document is an image-based document; upon the second document being an image-based document, applying (S104B) the first extraction technique to the second document to obtain the second data; and upon the second document not being an image-based document, applying (S104C) the second extraction technique to the second document to obtain the second data.
Item 11. The method according to any of the previous items, wherein applying (S106A) the textual sequence comparison logic comprises applying (S106AA) a fuzzy matching logic.
Item 12. The method according to item 11 , wherein applying (S106AA) the fuzzy matching logic comprises: determining (S106AAA), based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data, a match parameter per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data; and including (S106AAB) the match parameter in the modification data. Item 13. The method according to item 12, wherein determining (S106AAA), based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises: determining (S106AAAA) a similarity parameter between the respective token of the first data and the corresponding token of the second data; and determining (S106AAAB) the match parameter as indicative of a match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is above a threshold.
Item 14. The method according to any of items 12-13, wherein determining (S106AAA), based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises: determining (S106AAAC) the match parameter as indicative of no match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is equal or below the threshold.
Item 15. The method according to any of items 12-14, the method comprising generating (S105) a data frame by concatenating (S105A) at least two of: the first data, the second data, and the match parameter.
Item 16. The method according to item 15, wherein applying (S106A) the textual sequence comparison logic comprises applying (S106AB) a longest common subsequence technique to the data frame for providing a modification type.
Item 17. The method according to item 16, wherein applying (S106AB) the longest common subsequence technique to the data frame for providing the modification type comprises: determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type indicative of a type of modification between the respective token of the first data and the corresponding token of the second data; and including (S106ABB) the modification type in the modification data. Item 18. The method according to item 17, wherein determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively: determining (S106ABAA) a first longest common subsequence between the respective token of the first data and the corresponding token of the second data; performing (S106ABAB) a correction in the respective token of the first data; wherein the correction is one or more of: a shift, an alteration, an addition, and a deletion. determining (S106ABAC) a second longest common subsequence between the respective token of the first data and the corresponding token of the second data; comparing (S106ABAD) the first longest common subsequence with the second longest common subsequence; and determining (S106ABAE) whether the comparison meets a second criterion.
Item 19. The method according to item 18, wherein determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises: upon the comparison meeting the second criterion, determining (S106ABAF) the modification type based on one or more corrections.
Item 20. The method according to any of items 16-19, wherein the modification type comprises one or more of: a deletion type, an addition type, and an alteration type.
Item 21. The method according to any of the previous items, wherein providing (S108) the output comprises displaying (S108A), on a display device, a user interface object representative of the output.
Item 22. An electronic device comprising a memory, a processor, and an interface, wherein the electronic device is configured to perform any of the methods according to items 1-21. Item 23. A computer readable storage medium storing one or more programs, the one or more programs comprising instructions, which when executed by an electronic device cause the electronic device to perform any of the methods of items 1-21 .
The use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not imply any particular order, but are included to identify individual elements. Moreover, the use of the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. does not denote any order or importance, but rather the terms “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used to distinguish one element from another. Note that the words “first”, “second”, “third” and “fourth”, “primary”, “secondary”, “tertiary” etc. are used here and elsewhere for labelling purposes only and are not intended to denote any specific spatial or temporal ordering. Furthermore, the labelling of a first element does not imply the presence of a second element and vice versa.
It may be appreciated that Figs. 1A-6 comprises some circuitries or operations which are illustrated with a solid line and some circuitries or operations which are illustrated with a dashed line. The circuitries or operations which are comprised in a solid line are circuitries or operations which are comprised in the broadest example embodiment. The circuitries or operations which are comprised in a dashed line are example embodiments which may be comprised in, or a part of, or are further circuitries or operations which may be taken in addition to the circuitries or operations of the solid line example embodiments. It should be appreciated that these operations need not be performed in order presented. Furthermore, it should be appreciated that not all of the operations need to be performed. The exemplary operations may be performed in any order and in any combination.
It is to be noted that the word "comprising" does not necessarily exclude the presence of other elements or steps than those listed.
It is to be noted that the words "a" or "an" preceding an element do not exclude the presence of a plurality of such elements.
It should further be noted that any reference signs do not limit the scope of the claims, that the exemplary embodiments may be implemented at least in part by means of both hardware and software, and that several "means", "units" or "devices" may be represented by the same item of hardware.
The various exemplary methods, devices, nodes, and systems described herein are described in the general context of method steps or processes, which may be implemented in one aspect by a computer program product, embodied in a computer- readable medium, including computer-executable instructions, such as program code, executed by computers in networked environments. A computer-readable medium may include removable and non-removable storage devices including, but not limited to, Read Only Memory (ROM), Random Access Memory (RAM), compact discs (CDs), digital versatile discs (DVD), etc. Generally, program circuitries may include routines, programs, objects, components, data structures, etc. that perform specified tasks or implement specific abstract data types. Computer-executable instructions, associated data structures, and program circuitries represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps or processes.
Although features have been shown and described, it will be understood that they are not intended to limit the claimed disclosure, and it will be made obvious to those skilled in the art that various changes and modifications may be made without departing from the scope of the claimed disclosure. The specification and drawings are, accordingly, to be regarded in an illustrative rather than restrictive sense. The claimed disclosure is intended to cover all alternatives, modifications, and equivalents.

Claims

1 . A method, performed by an electronic device, for detection of modification in a first document, the method comprising: obtaining (S102) first data indicative of tabular data and/or non-tabular data from the first document; obtaining (S104) second data indicative of tabular data and/or non-tabular data from a second document; determining (S106) modification data by applying (S106A) a textual sequence comparison logic to the first data and the second data; and providing (S108), based on the modification data, an output.
2. The method according to claim 1 , wherein the first document is based on the second document by including a signature.
3. The method according to claim 2, wherein the signature is one of a physical signature and an electronic signature.
4. The method according to any of claims 2-3, wherein the second document is an original document prior to including the signature.
5. The method according to any of the previous claims, wherein obtaining (S102) the first data comprises: determining (S102A) whether the first document meets a first criterion; upon the first document meeting the first criterion, applying (S102B) a first extraction technique to the first document to obtain the first data, wherein the first extraction technique is based on optical extraction; and upon the first document not meeting the first criterion, applying (S102C) a second extraction technique to the first document to obtain the first data, wherein the second extraction technique is based on text mining.
6. The method according to any of the previous claims, wherein determining (S102A) whether the first document meets the first criterion comprises determining (S102AA) whether the first document includes a physical signature.
7. The method according to any of the previous claims, wherein determining (S102A) whether the first document meets the first criterion comprises determining (S102AB) whether the first document is an image-based document.
8. The method according to any of the previous claims, wherein the first document does not meet the first criterion when the first document is a text-based document.
9. The method according to any of the previous claims, wherein the first data comprises a token.
10. The method according to any of the previous claims, wherein obtaining (S104) the second data comprises: determining (S104A), whether the second document is an image-based document; upon the second document being an image-based document, applying (S104B) the first extraction technique to the second document to obtain the second data; and upon the second document not being an image-based document, applying (S104C) the second extraction technique to the second document to obtain the second data.
11. The method according to any of the previous claims, wherein applying (S106A) the textual sequence comparison logic comprises applying (S106AA) a fuzzy matching logic.
12. The method according to claim 11 , wherein applying (S106AA) the fuzzy matching logic comprises: determining (S106AAA), based on a fuzzy matching between a respective token of the first data and a corresponding token of the second data, a match parameter per token indicating for each respective token whether the respective token of the first data matches the corresponding token of the second data; and including (S106AAB) the match parameter in the modification data.
13. The method according to claim 12, wherein determining (S106AAA), based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises: determining (S106AAAA) a similarity parameter between the respective token of the first data and the corresponding token of the second data; and determining (S106AAAB) the match parameter as indicative of a match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is above a threshold.
14. The method according to any of claims 12-13, wherein determining (S106AAA), based on the fuzzy matching between the respective token of the first data and the corresponding token of the second data, the match parameter per token comprises: determining (S106AAAC) the match parameter as indicative of no match between the respective token of the first data and the corresponding token of the second data when the similarity parameter is equal or below the threshold.
15. The method according to any of claims 12-14, the method comprising generating (S105) a data frame by concatenating (S105A) at least two of: the first data, the second data, and the match parameter.
16. The method according to claim 15, wherein applying (S106A) the textual sequence comparison logic comprises applying (S106AB) a longest common subsequence technique to the data frame for providing a modification type.
17. The method according to claim 16, wherein applying (S106AB) the longest common subsequence technique to the data frame for providing the modification type comprises: determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type indicative of a type of modification between the respective token of the first data and the corresponding token of the second data; and including (S106ABB) the modification type in the modification data.
18. The method according to claim 17, wherein determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises iteratively: determining (S106ABAA) a first longest common subsequence between the respective token of the first data and the corresponding token of the second data; performing (S106ABAB) a correction in the respective token of the first data; wherein the correction is one or more of: a shift, an alteration, an addition, and a deletion. determining (S106ABAC) a second longest common subsequence between the respective corrected token of the first data and the corresponding token of the second data; comparing (S106ABAD) the first longest common subsequence with the second longest common subsequence; and determining (S106ABAE) whether the comparison meets a second criterion.
19. The method according to claim 18, wherein determining (S106ABA), based on the longest common subsequence technique between the respective token of the first data and the corresponding token of the second data, the modification type comprises: upon the comparison meeting the second criterion, determining (S106ABAF) the modification type based on one or more corrections.
20. The method according to any of claims 16-19, wherein the modification type comprises one or more of: a deletion type, an addition type, and an alteration type.
21. The method according to any of the previous claims, wherein providing (S108) the output comprises displaying (S108A), on a display device, a user interface object representative of the output.
22. An electronic device comprising a memory, a processor, and an interface, wherein the electronic device is configured to perform any of the methods according to claims 1-
21.
PCT/EP2023/074289 2022-09-14 2023-09-05 A method for detection of modification in a first document and related electronic device WO2024056457A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
DKPA202270444 2022-09-14
DKPA202270444 2022-09-14

Publications (1)

Publication Number Publication Date
WO2024056457A1 true WO2024056457A1 (en) 2024-03-21

Family

ID=87933899

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2023/074289 WO2024056457A1 (en) 2022-09-14 2023-09-05 A method for detection of modification in a first document and related electronic device

Country Status (1)

Country Link
WO (1) WO2024056457A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090028392A1 (en) * 2007-07-23 2009-01-29 Sitaram Ramachandrula Document Comparison Method And Apparatus
US20120072859A1 (en) * 2008-06-02 2012-03-22 Pricewaterhousecoopers Llp System and method for comparing and reviewing documents
US20170286767A1 (en) * 2013-12-18 2017-10-05 Abbyy Development Llc Method and apparatus for finding differences in documents

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090028392A1 (en) * 2007-07-23 2009-01-29 Sitaram Ramachandrula Document Comparison Method And Apparatus
US20120072859A1 (en) * 2008-06-02 2012-03-22 Pricewaterhousecoopers Llp System and method for comparing and reviewing documents
US20170286767A1 (en) * 2013-12-18 2017-10-05 Abbyy Development Llc Method and apparatus for finding differences in documents

Similar Documents

Publication Publication Date Title
US10943105B2 (en) Document field detection and parsing
WO2021212658A1 (en) Ocr image sample generation method and apparatus, print font verification method and apparatus, and device and medium
US12354396B2 (en) System for information extraction from form-like documents
RU2760471C1 (en) Methods and systems for identifying fields in a document
CN111079755B (en) Financial reimbursement data processing method, device and system
US20210366055A1 (en) Systems and methods for generating accurate transaction data and manipulation
KR101122854B1 (en) Method and apparatus for populating electronic forms from scanned documents
US12118813B2 (en) Continuous learning for document processing and analysis
US20070168382A1 (en) Document analysis system for integration of paper records into a searchable electronic database
US11501344B2 (en) Partial perceptual image hashing for invoice deconstruction
US12118816B2 (en) Continuous learning for document processing and analysis
CN110287784B (en) A method for identifying the text structure of annual reports
Alberti et al. Open evaluation tool for layout analysis of document images
CN118172785A (en) Document information extraction method, apparatus, device, storage medium, and program product
CN116882380A (en) Document template generation method for text management system
CN114529933A (en) Contract data difference comparison method, device, equipment and medium
CN113806472A (en) Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece
CN112560855B (en) Image information extraction method and device, electronic equipment and storage medium
CN115828874A (en) Industry table digital processing method based on image recognition technology
CN113496115A (en) File content comparison method and device
RU2597163C2 (en) Comparing documents using reliable source
WO2024056457A1 (en) A method for detection of modification in a first document and related electronic device
JP4518212B2 (en) Image processing apparatus and program
Susanty et al. Optical Character Recognition Implementation for Admission System in Universitas Pertamina
Gupta et al. Table detection and metadata extraction in document images

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23765249

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE