WO2020100018A1

WO2020100018A1 - A system and method for artificial intelligence-based proof reader for documents

Info

Publication number: WO2020100018A1
Application number: PCT/IB2019/059690
Authority: WO
Inventors: Sushma BHAT
Original assignee: Individual
Current assignee: Individual
Priority date: 2018-11-15
Filing date: 2019-11-12
Publication date: 2020-05-22
Anticipated expiration: 2021-05-15

Abstract

A system for artificial intelligence-based proof reader for documents is disclosed. The system includes a machine learning module including a machine learning classifier and configured to receive a digital document and to identify at least one of one or more positive sentences and one or more negative sentences present in the digital document. The system also includes a shallow parser module configured to receive the one or more negative sentences from the machine learning module. The shallow parser is also configured to apply a set of predetermined rules to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences. The shallow parser module is also configured to filter the one or more positive texts corresponding to a set of predefined patterns. The shallow parser module is also configured to highlight the filtered one or more positive texts.

Description

A SYSTEM AND METHOD FOR ARTIFICIAL INTELLIGENCE-BASED PROOF READER FOR DOCUMENTS

This International Application claims priority from a Complete patent application filed in India having Patent Application No. 201841043008, filed on November 15, 2018 and titled “A SYSTEM AND METHOD FOR ARTIFICIAL

INTELLIGENCE-BASED PROOF READER FOR DOCUMENTS”.

BACKGROUND

Embodiments of the present disclosure relate to proof reading documents, and more particularly to, a system and method for artificial intelligence-based proof reader for documents.

Usually, the companies especially the ones working with financial institutions or financial companies release a research report at the end of a financial year or on a regular basis while abiding standards of reporting. The news agencies in the business of collecting and collating news and publishing the same also require to cross check the veracity of the content of the news before getting released for publication.

The research reports are usually prepared by a research analyst, and sometimes there might be errors, especially those which might jeopardise the integrity of a company or wrongly influence the readers. Similarly, the news agencies nowadays suffer from the circulation of fake news and inflammatory news which may disrupt the public law and order.

In few scenarios, the research reports are manually checked by a designated expert of a company for inconsistences and regulatory constraints and for the presence of any conflicting text or speculative text, which may lead to jeopardizing the company's integrity and may result in the company paying huge fines to the government or a higher authority. This slows down the release of financial reports and despite the vigorous checking, there is always errors that goes unnoticed. The existing automated systems are found to be inefficient in solving the aforementioned issues marring the news agencies and financial institutions and companies.

Therefore, there is a need for a system which can scrutinise a given document and alarms the user regarding presence of any potential errors.

BRIEF DESCRIPTION

This summary is provided to introduce a selection of concepts, in a simple manner, which is further described in the detailed description of the invention. This summary is neither intended to identify key or essential inventive concepts of the subject matter nor to determine the scope of the invention.

In accordance with the present disclosure, a system for artificial intelligence-based proof reader for documents is disclosed. The system includes a machine learning module including a machine learning classifier and configured to receive a digital document. The machine learning classifier is also configured to identify at least one of one or more positive sentences and one or more negative sentences present in the digital document. The system also includes a shallow parser module configured to receive the one or more negative sentences from the machine learning module. The shallow parser is also configured to apply a set of predetermined rules to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences. The shallow parser module is also configured to filter the one or more positive texts corresponding to a set of predefined patterns. The shallow parser module is also configured to highlight the filtered one or more positive texts.

In accordance with another embodiment of the present disclosure a method for artificial intelligence-based proof reader for documents is disclosed. The method includes receiving, by a machine learning module, a digital document. The method also includes identifying, by the machine learning module, at least one of one or more positive sentences and one or more negative sentences present in the digital document. The method also includes receiving by a shallow parser, the one or more negative sentences from the machine learning module. The method also includes applying, by the shallow parser, a set of predetermined rules to the one or more negative sentences to extract the one or more positive texts in the one or more negative sentences. The method also includes filtering, by the shallow parser, the validated one or more positive texts corresponding to a set of predefined patterns. The method also includes highlighting, by the shallow parser, the filtered one or more positive texts.

To further clarify the advantages and features of the present invention, a more particular description of the invention will follow by reference to specific embodiments thereof, which are illustrated in the appended figures. It is to be appreciated that these figures depict only typical embodiments of the invention and are therefore not to be considered limiting in scope. The invention will be described and explained with additional specificity and detail with the appended figures.

BRIEF DESCRIPTION OF DRAWINGS The disclosure will be described and explained with additional specificity and detail with the accompanying figures in which:

FIG. 1 illustrates a block diagram of a system for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure; and FIG. 2 is a schematic representation of an exemplary system of a system for artificial intelligence-based proof reader for documents of FIG. 1 in accordance with an embodiment of the present disclosure.

FIG. 3 illustrates a flowchart representing the steps involved in a method for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure.

Further, those skilled in the art will appreciate that elements in the figures are illustrated for simplicity and may not have necessarily been drawn to scale. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the figures by conventional symbols, and the figures may show only those specific details that are pertinent to understanding the embodiments of the present invention so as not to obscure the figures with details that will be readily apparent to those skilled in the art having the benefit of the description herein.

DETAILED DESCRIPTION OF THE INVENTION

For the purpose of promoting an understanding of the principles of the invention, reference will now be made to the embodiment illustrated in the figures and specific language will be used to describe them. It will nevertheless be understood that no limitation of the scope of the invention is thereby intended. Such alterations and further modifications in the illustrated system, and such further applications of the principles of the invention as would normally occur to those skilled in the art are to be construed as being within the scope of the present invention. It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the invention and are not intended to be restrictive thereof.

The terms "comprises", "comprising", or any other variations thereof, are intended to cover a non-exclusive inclusion, such that one or more devices or sub-systems or elements or structures or components preceded by "comprises... a" does not, without more constraints, preclude the existence of other devices, sub-systems, elements, structures, components, additional devices, additional sub-systems, additional elements, additional structures or additional components. Appearances of the phrase "in an embodiment", "in another embodiment" and similar language throughout this specification may, but not necessarily do, all refer to the same embodiment.

Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by those skilled in the art to which this invention belongs. The system, methods, and examples provided herein are only illustrative and not intended to be limiting.

Embodiments of the present disclosure relate to a system for artificial intelligence- based proof reader for documents. The system includes a machine learning module including a machine learning classifier and configured to receive a digital document. The machine learning classifier is also configured to identify at least one of one or more positive sentences and one or more negative sentences present in the digital document. The system also includes a shallow parser module configured to receive the one or more negative sentences from the machine learning module. The shallow parser is also configured to apply a set of predetermined rules to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences. The shallow parser module is also configured to filter the one or more positive texts corresponding to a set of predefined patterns. The shallow parser module is also configured to highlight the filtered one or more positive texts.

FIG. 1 is a block diagram of a system (100) for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure. The system (100) includes a machine learning module (110), wherein the machine learning module (110) includes a machine learning classifier (120). In such embodiment, the machine learning classifier (120) may be based on supervised learning. In one embodiment, the machine learning classifier (120) may include a binary classifier. In such embodiment, the binary classifier may include a logistic regression, a support vector machine (SVM) classifier, a neural network, a k-nearest neighbour (kNN) classifier or a Naive Bayes classifier.

The machine learning classifier (120) is configured to receive a digital document. The machine learning classifier (120) evaluates received digital document based on a historical report. In one embodiment, the received digital document may include a financial document or a research report. In such embodiment, the financial document may be prepared by research analysts or financial brokers. The historical report in an offline environment is created manually by research analysts and reviewed by supervisory analysts. The historical report includes review tracker changes along with a published document. The review tracker changes along with the published document provides required sample data for training the binary classifier in order to identify positive and negative part of sentences. The sample data which is obtained from corrected text in the review tracker changes is categorized as positive samples, whereas the sample data which is obtained from published text is categorized as negative samples. The positive and negative samples of the sample data of the historical reports becomes the labels or target values used in prediction. The received digital document is evaluated based on the labels of the historical reports by generation of a machine learning model. The generated machine learning model is represented as Y = f (x), wherein Ύ represents an output of a process,‘x’ represents an input of the process and‘f represents a function of a variable x.

The machine learning classifier (120) is also configured to identify at least one of one or more positive sentences and one or more negative sentences present in the received digital document. The sentences of the received digital document or the dataset is trained and classified, using the generated machine learning model by splitting dataset into a training set and a testing set. The training set is a subset to train the generated model and the testing set is the subset to test the generated model. The training set and the testing set is created by splitting the dataset based on a predefined split ratio. The sentences of the received digital document after classification categorises the one or more positive sentences and the one or more negative sentences. In one embodiment, the one or more positive sentences includes a promissory text, a political text, an inflated text, a fact without source, a conflicting text and a speculative text. In such embodiment, the one or more positive texts may represent an incorrect text. In another embodiment, the one or more negative sentences includes a non-conflicting text and a non- speculative text.

The one or more positive sentences and the one or more negative sentences which are classified by the machine learning classifier (120) is evaluated by using a precision and a recall technique. As, used herein the term‘precision’ is defined as number of samples which are retrieved are relevant. The precision evaluates number of true positive samples divided by the number of true positive samples plus the number of false positive samples. As, used herein the term‘recall’ is defined as number of relevant samples which are retrieved. Similarly, the recall evaluates the number of true positive samples divided by the number of true positive samples plus the number of false negative samples. In one embodiment, the true positive samples are the positive samples which are correctly classified as positive. In another embodiment, the false positive samples are the samples of a test result which wrongly indicates that a particular condition or an attribute is present. In yet another embodiment, the false negative samples are the samples of the test result which wrongly indicates that the particular condition or attribute is absent. The machine learning model is configured to have a lower review threshold, so that none of incorrect text goes a miss. The machine learning model hence will have a lower precision value but a higher recall value.

The system (100) also includes a shallow parser module (130) configured to receive the one or more negative sentences from the machine learning module (110). In one embodiment, the shallow parser module (130) identifies text which is based on language construct. In such embodiment, the shallow parser module (130) analyses a sentence, identifies constituent parts of the sentence such as nouns, verbs, adjectives, determiners, modals or the like and then links such parts of the sentence to units with discrete grammatical meanings such as noun groups or phrases or verb groups.

The shallow parser module (130) is also configured to apply a set of predetermined rules for further validation to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences. In one embodiment, the set of predetermined rules may include one or more chunking rules. In such embodiment, the one or more chunking rules may be configured to identify one or more positive texts and the one or more negative texts. In one embodiment, the chunking rule may be defined for identifying the one or more positive texts from a sentence as used herein,‘The price of oil will go up’. In such embodiment, the chunking rule for identifying the one or more positive texts may include a rule, which is represented as ^L (<DT>? <NN. *> <IN> <NN. *> <MD> <VB> <RP>?), wherein the DT represents a determiner, NN represents a noun which may be singular or plural, MD represents a modal, VB represents a verb and RP represents a particle.

In another embodiment, the chunking rule may be defined for identifying the one or more positive texts from a sentence as used herein,‘Company A and Company B will merge and the benchmark to go up by 10%’. In such embodiment, the chunking rule for identifying the one or more positive texts may include a rule, which is represented as (<NN. *>*<MD> <RB>? <VB>), wherein the NN represents a noun which may be singular or plural, the MD represents a modal, RB represents adverb and VB represents verb.

In yet another embodiment, the chunking rule may be defined for identifying one or more positive texts from a sentence as used herein, ‘The market is overly optimistic on its future growth’. In such embodiment, the chunking rule may include a (<NN> <VBZ> <JJ> <IN>) | (<NN> <VBZ> <RB> <JJ> <IN>), wherein the NN represents the noun which may be singular or plural, VBZ represents verb with 3^rd person singular present, JJ represents adjective, and IN represents preposition or subordinating conjunction.

The shallow parser module (130) is also configured to filter the one or more positive texts corresponding to a set of predefined patterns. The shallow parser module (130) upon matching the sentence with the corresponding chunking rule filters the one or more positive texts for further validation, corresponding to the set of predefined patterns. In one embodiment, the set of predefined patterns may include at least one of one or more permitted patterns and at least one of one or more non-permitted patterns. In such embodiment, the at least one of one or more permitted patterns may include phrases such as we expect, we believe, maintain, may or could. In another embodiment, the at least one of one or more non-permitted patterns may include phrases such as will or should.

The shallow parser module (130) is also configured to highlight the filtered one or more positive texts. The filtered one or more positive texts are the texts with issues. In one embodiment, the texts with issues may include the conflicting text or the speculative text.

FIG. 2 is a schematic representation of an exemplary system (200) of a system (100) for artificial intelligence-based proof reader for documents of FIG. 1 in accordance with an embodiment of the present disclosure. One or more modules of a system for artificial intelligence-based proof reader for documents of FIG.2 is substantially similar to the one or more modules of the system for artificial intelligence -based proof reader for documents of FIG. 1. The system of FIG. 2 is a hybrid approach which includes a machine learning approach as well as a shallow parser for identifying conflicting texts. In the hybrid approach, an offline process includes creating a machine learning model based on a historical report. The historical report in an offline environment is created manually by research analysts and reviewed by supervisory analysts.

The historical report includes review tracker changes along with a published document. The review tracker changes along with the published document provides required sample data for training the binary classifier in order to identify positive and negative part of sentences. The sample data obtained from corrected text or the published text is categorized as positive samples, whereas the sample data which is obtained after the review tracker changes or corrections done to the sample data are negative samples. The positive and negative samples of the sample data of the historical reports becomes the labels or target values used in prediction. Also, a shallow parser rules database is created manually in the offline process to receive one or more negative sentences. The received one or more negative sentences are further validated by manually creating a set of permitted and a set of non-permitted phrases. Similarly, real-time process of the hybrid approach includes receiving and reading a digital document by a machine learning module. The received digital document for example a financial document of an organisation is read and evaluated by a binary machine learning classifier such as a support vector machine (SVM) based on the historical report. The SVM classifier classifies one or more sentences of the financial document into one or more positive sentences or one or more negative sentences. The sentences of the financial document, which is a dataset for the machine learning model is split into a training set and a testing set based on a predefined split ratio. Here, the split ratio considered for the classification is 70:30, wherein 70 percent of the dataset is the training set and 30 percent of the dataset is the testing set. For example, the one or more positive sentences are a promissory text, a political text, an inflated text, a fact without source, a conflicting text or a speculative text.

Similarly, the one or more negative sentences includes a non-conflicting text and a non- speculative text. The one or more positive sentences and the one or more negative sentences after the classification are evaluated by using a precision and a recall technique. The precision here calculates exactness or accuracy of the samples of the dataset which are predicted correctly. The precision calculates how many of the selected samples were correctly predicted. But the recall, calculates how many of the sample that should have been selected were actually selected. For example, the recall identifies the number of relevant samples which are retrieved correctly. The machine learning model needs to have a lower precision value but a higher recall value in order to ensure that no review content or incorrect text is missed out and goes into the published document. But the shallow parser has a higher precision value and a lower recall value.

The one or more negative sentences after the classification are passed to the shallow parser for identifying text based on language construct. The shallow parser analyses the one or more sentences and parses the sentences based on parts of speech (POS) tagging. The shallow parser analyses a sentence, identifies constituent parts of the sentence such as nouns, verbs, adjectives, determiners, modals or the like and then links such parts of the sentence to units with discrete grammatical meanings such as noun groups or phrases or verb groups. For example, a negative sentence may be ‘The price of oil will go up’.

The shallow parser also applies a set of chunking rules on the abovementioned negative sentence for further validation to extract one or more positive texts. The set of chunking rules includes the chunking rule for identifying positive texts and the chunking rule for identifying the one or more negative texts. For example, if the abovementioned sentence is considered,‘The price of oil will go up’, then the chunking rule for such sentence may be defined as ^L (<DT>? <NN. *> <IN> <NN. *> <MD> <VB> <RP>?), wherein the DT represents a determiner, NN represents a noun which may be singular or plural, MD represents a modal, VB represents a verb and RP represents a particle. So, here the positive text is‘will’.

The positive text is then again filtered by the shallow parser for further validation corresponding to a set of predefined patterns. In one embodiment, the set of predefined patterns may include at least one of one or more permitted patterns and at least one of one or more non-permitted patterns. In such embodiment, the at least one of one or more permitted patterns may include phrases such as we expect, we believe, maintain, may or could. In another embodiment, the at least one of one or more non-permitted patterns may include phrases such as will or should. After, the shallow parser filters the positive text corresponding to the set of predefined patterns, the positive text for example,‘will’ gets highlighted and one or more suggestions for permitting the sentence may be shown such as‘We expect, the price of oil will go up’. Here,‘we expect’ is the permitted pattern and the corresponding sentence becomes negative or sentences without issues or the non-conflicting text, which when passed to the shallow parser again will be discarded.

FIG. 3 illustrates a flowchart representing the steps involved in a method (300) for artificial intelligence-based proof reader for documents in accordance with an embodiment of the present disclosure. The method (300) includes receiving, by a machine learning module, a digital document in step 310. In one embodiment, the method (300) includes receiving by the machine learning module the digital document such as a financial report or a research report. The machine learning module by receiving the digital document is configured to evaluate received digital document based on a historical report.

The method (300) also includes identifying, by the machine learning module, at least one of one or more positive sentences and one or more negative sentences present in the digital document in step 320. In one embodiment, identifying by the machine learning module the at least one of one or more positive sentences and the one or more negative sentences may include identifying or classifying by a machine learning classifier the at least one of one or more positive sentences or the at least one of one or more negative sentences. Classification or identification of the at least one of one or more positive sentences and the at least one of one or more negative sentences of the received digital document includes training one or more sentences of the received digital document or a dataset and splitting the dataset into a training set and a testing set based on a split ratio, by a generated machine learning model.

In one embodiment, the one or more positive sentences includes a promissory text, a political text, an inflated text, a fact without source, a conflicting text and a speculative text. In such embodiment, the one or more positive texts may represent an incorrect text. In another embodiment, the one or more negative sentences includes a non-conflicting text and a non- speculative text. The one or more positive sentences and the one or more negative sentences which are classified by the machine learning classifier is evaluated by using a precision and a recall technique. In one embodiment, the precision is a fraction of relevant instances among the retrieved instances.

The method (300) also includes receiving by a shallow parser, the one or more negative sentences from the machine learning module in step 330. In one embodiment, receiving by the shallow parser, the one or more negative sentences includes obtaining the one or more negative sentences from the machine learning module and identifying text based on language construct. In such embodiment, identifying the text may include analysing a sentence, identifying constituent parts of the sentence such as nouns, verbs, adjectives, determiners, modals or the like and linking such parts of the sentence to units with discrete grammatical meanings such as noun groups, or phrases or verb groups.

The method (300) also includes applying, by the shallow parser, a set of predetermined rules to the one or more negative sentences to extract the one or more positive texts in the one or more negative sentences in step 340. In one embodiment, applying by the shallow parser, the set of predetermined rules to the one or more negative sentences to extract the one or more positive texts in the one or more negative sentences may include applying the set of predetermined rules such as one or more chunking rules. In such embodiment, the one or more chunking rules may be configured to identify one or more positive texts and the one or more negative texts.

The method (300) also includes filtering, by the shallow parser, the validated one or more positive texts corresponding to a set of predefined patterns in step 350. In one embodiment, filtering, by the shallow parser, the validated one or more positive texts corresponding to the set of predefined patterns may include filtering the one or more positive texts for further validation corresponding to the set of predetermined patterns such as at least one of one or more permitted patterns or at least one of one or more permitted patterns. In such embodiment, the at least one of one or more permitted patterns may include phrases such as we expect, we believe, maintain, may or could. In another embodiment, the at least one of one or more non-permitted patterns may include phrases such as will or should.

The method (300) also includes highlighting, by the shallow parser, the filtered one or more positive texts in step 360. In one embodiment, highlighting by the shallow parser the filtered one or more positive texts may include highlighting the texts with issues. In such embodiment, the texts with the issues may include the conflicting text or the speculative text. Various embodiments of the present disclosure enables identifying conflicting text within the digital document that can lead to potential litigations in the future from institutional investors and retail clients by using a hybrid approach of machine learning as well as shallow parser. Moreover, the present disclosed system provides a solution which allows the digital document to be verified for issues based on the entire context rather than using pattern matching techniques.

Furthermore, the present disclosed system aids the supervisory analysts and reduce the human error and as a result enhances productivity during research report review. It will be understood by those skilled in the art that the foregoing general description and the following detailed description are exemplary and explanatory of the disclosure and are not intended to be restrictive thereof.

While specific language has been used to describe the disclosure, any limitations arising on account of the same are not intended. As would be apparent to a person skilled in the art, various working modifications may be made to the method in order to implement the inventive concept as taught herein.

The figures and the foregoing description give examples of embodiments. Those skilled in the art will appreciate that one or more of the described elements may well be combined into a single functional element. Alternatively, certain elements may be split into multiple functional elements. Elements from one embodiment may be added to another embodiment. For example, the order of processes described herein may be changed and are not limited to the manner described herein. Moreover, the actions of any flow diagram need not be implemented in the order shown; nor do all of the acts need to be necessarily performed. Also, those acts that are not dependent on other acts may be performed in parallel with the other acts. The scope of embodiments is by no means limited by these specific examples.

Claims

WE CLAIM:

1. A system for artificial intelligence-based proof reader for documents comprising: a machine learning module, comprising a machine learning classifier configured to: receive a digital document; identify at least one of one or more positive sentences and one or more negative sentences present in the digital document; a shallow parser module configured to: receive the one or more negative sentences from the machine learning module; apply a set of predetermined rules to the one or more negative sentences to extract one or more positive texts in the one or more negative sentences; filter the one or more positive texts corresponding to a set of predefined patterns; and highlight the filtered one or more positive texts.

2. The system as claimed in claim 1, wherein the one or more positive sentences comprises a promissory text, a political text, an inflated text, a fact without source, a conflicting text and a speculative text.

3. The system as claimed in claim 1, wherein the one or more negative sentences comprise of non-conflicting text and non-speculative text.

4. The system as claimed in claim 1, wherein the one or more positive texts represents an incorrect text and the one or more negative texts represents a non- conflicting text and non- speculative text.

5. The system as claimed in claim 1 , wherein the machine learning classifier is configured to pre-set a prediction threshold to identify the one or more positive sentences in the digital document.

6. The system as claimed in claim 1 , wherein the machine learning classifier compares the digital document with one or more historical reports to identify at least one of the one or more positive sentences and the one or more negative sentences.

7. The system as claimed in claim 1, wherein the set of predetermined rules comprises of one or more chunking rules.

8. The system as claimed in claim 1, wherein the predefined patterns comprises of at least one of one or more permitted patterns and one or more non- permitted patterns.

9. The system as claimed in claim 1, wherein the machine learning module is configured to highlight the one or more positive sentences.

10. A method for artificial intelligence-based proof reader for documents comprising: receiving, by a machine learning module, a digital document; identifying, by the machine learning module, at least one of one or more positive sentences and one or more negative sentences present in the digital document; receiving, by a shallow parser, the one or more negative sentences from the machine learning module; applying, by the shallow parser, a set of predetermined rules to the one or more negative sentences to extract the one or more negative texts in the one or more negative sentences; filtering, by the shallow parser, the validated one or more negative texts corresponding to a set of predefined patterns; and highlighting, by the shallow parser, the filtered one or more negative texts.