WO2019024050A1

WO2019024050A1 - Deep context-based grammatical error correction using artificial neural networks

Info

Publication number: WO2019024050A1
Application number: PCT/CN2017/095841
Authority: WO
Inventors: Hui Lin; Chuan Wang; Ruobing LI
Original assignee: Lingochamp Information Technology (Shanghai) Co., Ltd.
Priority date: 2017-08-03
Filing date: 2017-08-03
Publication date: 2019-02-07
Also published as: JP7031101B2; CN111226222B; JP2020529666A; MX2020001279A; KR102490752B1; CN111226222A; KR20200031154A

Abstract

Disclosed herein are methods and systems for grammatical error detection. In one example, a sentence is received. One or more target words in the sentence are identified based, at least in part, on one or more grammatical error types. Each of the one or more target words corresponds to at least one of the one or more grammatical error types. For at least one of the one or more target words, a classification of the target word with respect to the corresponding grammatical error type is estimated using an artificial neural network model trained for the grammatical error type. A grammatical error in the sentence is detected based, at least in part, on the target word and the estimated classification of the target word.

Description

DEEP CONTEXT-BASED GRAMMATICAL ERROR CORRECTION USING ARTIFICIAL NEURAL NETWORKS

BACKGROUND

The disclosure relates generally to artificialintelligence, and more particularly, to grammatical error correction using artificial neural networks.

Automated grammatical error correction (GEC) is an essential and useful tool for millions of people who learn English as a second language. These writers make a variety of grammar and usage mistakes that are not addressed by standard proofing tools. Developing automated system with high precision and recall for grammatical error detection and/or correction becomes a fast-growing area in natural language process (NLP) .

While there is much potential for such automated system, known systems have encountered issues, such as limited coverage of various grammatical error patternsand requirement of elaborated linguisticfeature engineeringor human-annotated training samples.

SUMMARY

In one example, a method for grammatical error detection is disclosed. A sentence is received. One or more target words in the sentence are identifiedbased, at least in part, on one or more grammatical error types. Each of the one or more target words corresponds to at least one of the one or more grammatical error types. For at least one of the one or more target words, aclassification of the target word with respect to the corresponding grammatical error type is estimatedusing an artificial neural network model trained for the grammatical error type. The model includes two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence. The model further includes a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word. A grammatical error in the sentence is detected based, at least in part, on the target word and the estimated classification of the target word.

In another example, a method for training an artificial neural network model is provided. An artificial neural network model for estimating a classification of a target word in a sentence with respect to a grammatical error type is provided. The model includes two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence. The model further includes a feedforward neural network configured to output a classification value of the target word based, at least in part, on the context vector of the target word. A set of training samples are obtained. Each training sample in the set of training samples includes a sentence including a target word with respect to the grammatical error type and an actual classification of the target word with respect to the grammatical error type. A first set of parameters associated with the recurrent neural networks and a second set of parameters associated with the feedforward neural network are jointly trainedbased, at least in part, on differences between the actual classifications and estimated classifications of the target words in each training sample.

In a different example, a system for grammatical error detection includes a memory and at least one processor coupled to the memory. The at least one processor is configured to receive a sentence and identify one or more target words in the sentence based, at least in part, on one or more grammatical error types. Each of the one or more target words corresponds to at least one of the one or more grammatical error types. The at least one processor is further configured to, for at least one of the one or more target words, estimate a classification of the target word with respect to the corresponding grammatical error type using an artificial neural network model trained for the grammatical error type. The model includes two recurrent neural networks configured to generate a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence. The model further includes a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word. The at least one processor is further configured todetect a grammatical error in the sentence based, at least in part, on the target word and the estimated classification of the target word.

In another example, a system for grammatical error detection includes a memory and at least one processor coupled to the memory. The at least one processor is configured to provide an artificial neural network model for estimating a classification of a target word in a sentence with respect to a grammatical error type. The model includes two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence. The model further includes a feedforward neural network configured to output a classification value of the target word based, at least in part, on the context vector of the target word. The at least one processor is further configured toobtain a set of training samples. Each training sample in the set of training samples includes a sentence including a target word with respect to the grammatical error type and an actual classification of the target word with respect to the grammatical error type. The at least one processor is further configured to jointly adjust a first set of parameters associated with the recurrent neural networks and a second set of parameters associated with the feedforward neural network based, at least in part, on differences between the actual classifications and estimated classifications of the target words in each training sample.

Other concepts relate to software for grammatical error detection and artificial neural network modeltraining. A software product, in accord with this concept, includes at least one computer-readable, non-transitory device and information carried by the device. The information carried by the device may be executable instructions regarding parameters in association with a request or operational parameters.

In one example, a tangible computer-readable and non-transitory devicehaving instructions recorded thereon for grammatical error detection, wherein the instructions, when executed by the computer, cause the computerto perform a series of operations. Asentence is received. One or more target words in the sentence are identifiedbased, at least in part, on one or more grammatical error types. Each of the one or more target words corresponds to at least one of the one or more grammatical error types. For at least one of the one or more target words, aclassification of the target word with respect to the corresponding grammatical error type is estimatedusing an artificial neural network model trained for the grammatical error type. The model includes two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence. The model further includes a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word. A grammatical error in the sentence is detected based, at least in part, on the target word and the estimated classification of the target word.

In another example, a tangible computer-readable and non-transitory devicehaving instructions recorded thereon for training an artificial neural network model, wherein the instructions, when executed by the computer, cause the computer to perform a series of operations. An artificial neural network model for estimating a classification of a target word in a sentence with respect to a grammatical error type is provided. The model includes two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence. The model further includes a feedforward neural network configured to output a classification value of the target word based, at least in part, on the context vector of the target word. A set of training samples are obtained. Each training sample in the set of training samples includes a sentence including a target word with respect to the grammatical error type and an actual classification of the target word with respect to the grammatical error type. A first set of parameters associated with the recurrent neural networks and a second set of parameters associated with the feedforward neural network are jointly trainedbased, at least in part, on differences between the actual classifications and estimated classifications of the target words in each training sample.

This Summary is provided merely for purposes of illustrating some embodiments to provide an understanding of the subject matter described herein. Accordingly, the above-described features are merely examples and should not be construed to narrow the scope or spirit of the subject matter in this disclosure. Other features, aspects, and advantages of this disclosure will become apparent from the following Detailed Description, Figures, and Claims.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are incorporated herein and form part of the specification, illustrate the presented disclosure and, together with the description, further serve to explain the principles of the disclosure and enable a person of skill in the relevant art (s) to make and use the disclosure.

FIG. 1 is a block diagram illustrating a grammatical error correction (GEC) system in accordance with an embodiment；

FIG. 2 is a depiction of an example of automated grammatical error correction performed by the system in FIG. 1；

FIG. 3 is a flow chart illustrating an example of a method for grammatical error correctionin accordance with an embodiment；

FIG. 4 is a block diagram illustrating an example of a classification-based GEC module of the system in FIG. 1 in accordance with an embodiment；

FIG. 5 is a depiction of an example of providing a classification of a target word in a sentence using the system in FIG. 1 in accordance with an embodiment；

FIG. 6 is a schematic diagram illustrating an example of an artificial neural network (ANN) model for grammatical error correctionin accordance with an embodiment；

FIG. 7 is a schematic diagram illustrating another example of an ANN model for grammatical error correction in accordance with an embodiment；

FIG. 8 is a detailed schematic diagram illustrating an example of the ANN model in FIG. 6 in accordance with an embodiment；

FIG. 9 is a flow chart illustrating an example of a method for grammatical error correctionof a sentence in accordance with an embodiment；

FIG. 10 is a flow chart illustrating an example of a method for classifying a target word with respect to a grammatical error type in accordance with an embodiment；

FIG. 11 is a flow chart illustrating another example of a method for classifying a target word with respect to a grammatical error type in accordance with an embodiment；

FIG. 12 is a flow chart illustrating an example of a method for providing a grammarscore in accordance with an embodiment；

FIG. 13 is a block diagram illustrating anANN model training system in accordance with an embodiment；

FIG. 14 is a depiction of an example of a training sample used by the system in FIG. 13；

FIG. 15 is a flow chart illustrating an example of a method for ANN model training for grammatical error correction in accordance with an embodiment；

FIG. 16 is a schematic diagram illustrating an example of trainingan ANN model for grammatical error correction in accordance with an embodiment； and

FIG. 17 is a block diagram illustrating an example of a computer system useful for implementing various embodiments set forth in the disclosure.

The presented disclosure is described with reference to the accompanying drawings. In the drawings, generally, like reference numbers indicate identical or functionally similar elements. Additionally, generally, the left-most digit (s) of a reference number identifies the drawing in which the reference number first appears.

DETAILED DESCRIPTION

In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant disclosures. However, it should be apparent to those skilled in the art that the present disclosuremay be practiced without such details. In other instances, well known methods, procedures, systems, components, and/or circuitry have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present disclosure.

Throughout the specification and claims, terms may have nuanced meanings suggested or implied in context beyond an explicitly stated meaning. Likewise, the phrase “in one embodiment/example” as used herein does not necessarily refer to the same embodiment and the phrase “in another embodiment/example” as used herein does not necessarily refer to a different embodiment. It is intended, for example, that claimed subject matter include combinations of example embodiments in whole or in part.

In general, terminology may be understood at least in part from usage in context. For example, terms, such as “and” , “or” , or “and/or, ” as used herein may include a variety of meanings that may depend at least in part upon the context in which such terms are used. Typically, “or” if used to associate a list, such as A, B or C, is intended to mean A, B, and C, here used in the inclusive sense, as well as A, B or C, here used in the exclusive sense. In addition, the term “one or more” as used herein, depending at least in part upon context, may be used to describe any feature, structure, or characteristic in a singular sense or may be used to describe combinations of features, structures or characteristics in a plural sense. Similarly, terms, such as “a, ” “an, ” or “the, ” again, may be understood to convey a singular usage or to convey a plural usage, depending at least in part upon context. In addition, the term “based on” may be understood as not necessarily intended to convey an exclusive set of factors and may, instead, allow for existence of additional factors not necessarily expressly described, again, depending at least in part on context.

As will be disclosed in detail below, among other novel features, the automated GEC systems and methods disclosed herein provide the ability to efficiently and effectively detect and correct grammatical errors using a deep context model that can be trained from native text data. In some embodiments, for a specific grammatical error type, the error correction task can be treated as a classification problem where the grammatical context representation can be learnt from native text data that is largely available. Compared with traditional classifier methods, the systems and methods disclosed herein do not require sophisticated feature engineering, which usually requires linguistic knowledge and may not cover all context patterns. In some embodiments, instead of using surface and shallow features, the systems and methods disclosed herein can use deep features directly, such as recurrent neural networks to represent context. In some embodiments, unlike traditional NLP tasks, where a large amount of supervised data is usually needed but available in limited size, the systems and methods disclosed herein can leverage the abundant native plain text corpora and learn context representation and classification jointly in an end-to-end fashionto correct grammatical errors effectively.

Additional novel features will be set forth in part in the description which follows, and in part will become apparent to those skilled in the art upon examination of the following and the accompanying drawings or may be learned by production or operation of the examples. The novel features of the present disclosure may be realized and attained by practice or use of various aspects of the methodologies, instrumentalities, and combinations set forth in the detailed examples discussed below.

FIG. 1 is a block diagram illustrating a GEC system 100 in accordance with an embodiment. GEC system 100 includes an input pre-processing module 102, a parsing module 104, a target word dispatching module 106, and a plurality of classification-based GEC modules 108, each of which is configured to perform classification-based grammatical error detection and correction using deep context. In some embodiments, GEC system 100 may be implemented using a pipeline architecture to combine other GEC methods, such as machine translation and predefined rule-based methods, with the classification-based method to further improve the performance of GEC system 100. As shown in FIG. 1, GEC system 100 may further include a machine translation-based GEC module 110, a rule-based GEC module 112, and a scoring/correction module 114.

Input pre-processing module 102 is configured to receive an input text 116 and pre-process input text 116. Input text 116 may include at least one English sentence, for example, a single sentence, a paragraph, an article, or any text corpus. Input text 116 may be received directly, for example, via hand writing, typing, or copying/pasting. Input text 116 may be received indirectly as well, for example, via speech recognition or image recognition. For example, any suitable speech recognition techniques may be used to convert voice input into input text 116. In another example, any suitable optical character recognition (OCR) techniques may be used to transfer text contained in images into input text 116.

Input pre-processing module 102 may pre-process input text 116 in various manners. In some embodiments, as grammatical errors are usually analyzed in the context of a particular sentence, input pre-processing module 102 may divide input text 116 into sentences so that each sentence can be treated as a unit for the later process. Partitioning input text 116 into sentences may be performed by recognizing the beginning and/or end of a sentence. For example, input pre-processing module 102 may search for certain punctuations, such as period, semicolon, question mark, or exclamation mark, as the indicators of the end of a sentence. Input pre-processing module 102 may also search for a word with the first letter capitalized as the indicator of the start of a sentence. In some embodiments, input pre-processing module 102 may lowercase input text 116 for ease of the later process, for example, by converting any uppercase letters in input text 116 to lowercase letters. In some embodiments, input pre-processing module 102 may also check the tokens (words, phrases, or any text strings) in input text 116 against a vocabulary database 118 to determine any tokens that are not in vocabulary database 118. The unmatched tokens may be treadedas special tokens, e.g., single unk tokens (unknown tokens) . Vocabulary database 118 includes all the words that can be processed by GEC system 100. Any words or other tokens that are not in vocabulary database 118 may be ignored or treated differently by GEC system 100.

Parsing module 104 is configured to parse input text 116 to identify one or more target words in each sentence of input text 116. Different from known systems that consider all the grammatical errors unified and attempt to translate incorrect text into correct text, GEC system 100 uses models trained for each specific grammatical error type as described below in detail. Thus, in some embodiments, parsing module 104 may identify the target words from the text tokens in each sentence based on predefined grammatical error types so that each target word corresponds to at least one of the grammatical error types. The grammatical error types include, but not limited to, article error, subjective agreement error, verb form error, preposition error, and noun number error. It is to be appreciated that the grammatical error types are not limited to the examples above and may include any other types. In some embodiments, parsing module 104 may tokenize each sentence and identify the target words from the tokens in conjunction with vocabulary database 118, which includes vocabulary information and knowledge known to GEC system 100.

For example, for the subjective agreement error, parsing module 104 may extract the non-third person singular present words and third person singular present words map relationships in advance. Parsing module 104 then may locate the verbs as the target words. For the article error, parsing module 104 may locate the nouns and noun phrases (combinations of noun words and adjective words) as the target words. For the verb form error, parsing module 104 may locate the verbs as the target words, which are in the base forms, gerund or present participles, or past participles. Regarding the preposition error, parsing module 104 may locate the prepositions as the target words. As to the noun number error, parsing module 104 may locate the nouns as the target words. It is to be appreciated that one word may be identified by parsing module 104 as corresponding to multiple grammatical error types. For example, a verb may be identified as the target word with respect to the subjective agreement error and the verb form error, and a noun or noun phrase may be identified as the target word with respect to the article error and the noun number error. It is also to be appreciated that a target word may include a phrase that is a combination of multiple words, such as a noun phrase.

In some embodiments, for each grammatical error type, parsing module 104 may be configured to determine the actual classification of each target word. Parsing module 104 may assign an original label to each target word with respect to the corresponding grammatical error type as the actual classification valueof the target word. For example, for the subjective agreement error, the actual classification of a verb is either third person singular present form or base form. Parsing module 104 may assign the target word with the original label, for example “1” if the target word is in the third person singular present form or “0” if the target word is in the base form. For the article error, the actual classifications of the target words may be “a/an, ” “the, ” or “no article. ” Parsing module 104 may check the article in front of the target word (anoun word or noun phrase) to determine the actual classification of each target word. Regarding the verb form error, the actual classifications of the target words (e.g., verbs) may be “base form, ” “gerund or present participle, ” or “past participle. ” As to the preposition error, the most often used prepositions may be used by parsing module 104 as the actual classifications. In some embodiments, the actual classifications include 11 original labels: “about, ” “at, ” “by, ” “for, ” “from, ” “in, ” “of, ” “on, ” “to, ” “until, ” “with, ” and “against. ” Regarding the noun number error, the actual classifications of the target words (e.g., nouns) may be singular form or plural form. In some embodiments, parsing module 104 may determine the original label of each target word with respect to the corresponding grammatical error type based on the part of speech (PoS) tags in conjunction with vocabulary database 118.

Target word dispatching module 106 is configured to dispatch each target word to classification-based GEC module 108 for the corresponding grammatical error type. In some embodiments, for each grammatical error type, an ANN model 120 is independently trained and used by corresponding classification-based GEC module 108. Thus, each classification-based GEC module 108 is associated with one specific grammatical error type and is configured to handle the target words with respect to the same grammatical error type. For example, for a target word that is a preposition (with respect to the preposition error type) , target word dispatching module 106 may send the preposition to classification-based GEC module 108 that handles preposition errors. It is to be appreciated that as one word may be determined as a target word with respect to multiple grammatical error types, target word dispatching module 106 may send the same target word to multiple classification-based GEC modules 108. It is also to be appreciated that in some embodiments, the resources assigned by GEC system 100 to each classification-based GEC module 108 may not be equal. For example, depending on the frequency in which each grammatical error type occurred within a certain user cohort or for a particular user, target word dispatching module 106 may dispatch the target words with respect to the most-frequently occurred grammatical error type with the highest priority. For input text 116 with a large size, e.g., a large number of sentences and/or target words in each sentence, target word dispatching module 106 may schedule the processing of each target word in each sentence in an optimal manner in view of the workload of each classification-based GEC module 108 to reduce latency.

Each classification-based GEC module 108 includes corresponding ANN model 120 that has been trained for the corresponding grammatical error type. Classification-based GEC module 108 is configured to estimate a classification of the target word with respect to the corresponding grammatical error type using corresponding ANN model 120. As described below in detail, in some embodiments, ANN model 120 includes two recurrent neural networks configured to output a context vector of the target word based on at least one word before the target word and at least one word after the target word in the sentence. ANN model 120 further includes a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based on the context vector of the target word.

Classification-based GEC module 108 is further configured to detect a grammatical error in the sentence based on the target word and the estimated classification of the target word. As described above, in some embodiments, the actual classification of each target word may be determined by parsing module 104. Classification-based GEC module 108 then may compare the estimated classification of the target word with the actual classification of the target word, and detect the grammatical error in the sentence when the actual classification does not match the estimated classification of the target word. For example, for a certain grammatical error type, corresponding ANN model 120 may learn an embedding function of variable-length context surrounding the target word, and corresponding classification-based GEC module 108 may predict the classification of the target word with the context embedding. If the predicted classification label is different from the original label of the target word, the target word may be flagged as an error, and the prediction may be used as correction.

As shown in FIG. 1, in some embodiments, multiple classification-based GEC modules 108 may be applied in parallel in GEC system 100 to concurrently detect grammatical errors for various grammatical error types. As described above, the resources of GEC system 100 maybe assigned to different grammatical error types based on the occurrence frequencies of each grammatical error type. For example, more computational resources may be allocated by GEC system 100 to handle grammatical error types that occur more frequently than others. The allocation of resources may be dynamically adjusted in view of the frequency change and/or the workload of each classification-based GEC module 108.

Machine translation-based GEC module 110 is configured to detect one or more grammatical errors in each sentence based on statistical machine translation, such as phrase-based machine translation, neural network-based machine translation, etc. In some embodiments, machine translation-based GEC module 110 includes a model having a language sub-model assigning a probability for a sentence and a translation sub-model assigning a conditional probability. The language sub-model may be trained using a monolingual training data set in the target language. The parameters of the translation sub-model may be estimated from a parallel training data set, i.e., the set of foreign sentences and their corresponding translations into the target language. It is to be appreciated that in the pipeline architecture of GEC system 100, machine translation-based GEC module 110 maybe applied to the output of classification-based GEC modules 108, or classification-based GEC modules 108 may be applied to the output of machine translation-based GEC module 110. Also, in some embodiments, by adding machine translation-based GEC module 110 into the pipeline, certain classification-based GEC modules 108 that may be outperformed by machine translation-based GEC module 110 may not be included in the pipeline.

Rule-based GEC module 112 is configured to detect one or more grammatical errors in each sentence based on predefined rules. It is to be appreciated that the position of rule-based GEC module 112 in thepipeline is not limited to the end as shown in FIG. 1, but can be at the beginning of the pipeline as the first detection module or between classification-based GEC modules 108 and machine translation-based GEC module 110. In some embodiments, other mechanical errors, such as punctuations, spellings, and capitalization errors, can be detected and fixed using predefined rules by rule-based GEC module 112 as well.

Scoring/correction module 114 is configured to provide a corrected text and/or grammar score 122 of input text 116 based on the grammatical error results received from the pipeline. Taking classification-based GEC modules 108 for example, for each target word that is detected as having a grammatical error because the estimated classification does not match the actual classification, the grammatical error correction of the target word may be provided by scoring/correction module 114 based on the estimated classification of the target word. To evaluate input text 116, scoring/correction module 114 may also provide grammar score 122 based on the grammatical error results received from the pipeline using a scoring function. In some embodiments, the scoring function may assign weights to each grammatical error type so that grammatical errors in different types may have different levels of impact to grammar score 122. Weights may be assigned to precision and recall as the weighted factors in evaluating the grammatical error results. In some embodiments, the user from whom input text 116 is provided may be considered by the scoring function as well. For example, the weights may be different for different users, or the information of the user (e.g., native language, residency, education level, historical scores, age, etc. ) may be factored into the scoring function.

FIG. 2 is a depiction of an example of automated grammatical error correction performed by GEC system 100 in FIG. 1. As shown in FIG. 2, an input text 202 includes a plurality of sentences and is received from a user identified by a user ID -1234. After passing through GEC system 100 with a plurality of ANN models 120, each of which is individually trained for a corresponding grammatical error type, a corrected text 204 with a grammar score is provided for the user. For example, in the sentence “it will just adding on their misery” in input text 202, the verb “adding” is identified as a target word with respect to the verb form error by GEC system 100. The actual classification of the target word “adding” is a gerund or present participle. GEC system 100 applies ANN model 120 trained for the verb form error and estimates that the classification of the target word “adding” is the base form - “add. ” As the estimated classification does not match the actual classification of the target word “adding, ” a verb form grammatical error is detected by GEC system 100, which affects the grammar score in view of the weight applied to the verb form error type and/or the personal information of the user. The estimated classification of the target word “adding” is also used by GEC system 100 to provide the correction “add” to replace “adding” in corrected text 204. The same ANN model 120 for the verb form error is used by GEC system 100 to detect and correct other verb form errors in input text 202, such as “dishearten” to “disheartening. ” ANN models 120 for other grammatical error types are used by GEC system 100 to detect other types of grammatical errors. For example, ANN model 120 for the preposition error is used by GEC system 100 to detect and correct preposition errors in input text 202, such as “for” to “in, ” and “to” to “on. ”

FIG. 3 is a flow chart illustrating an example of a method 300 for grammatical error correctionin accordance with an embodiment. Method 300 can be performed by processing logic that can comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc. ) , software (e.g., instructions executing on a processing device) , or a combination thereof. It is to be appreciated that not all steps may be needed to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 3, as will be understood by a person of ordinary skill in the art.

Method 300 shall be described with reference to FIG. 1. However, method 300 is not limited to that example embodiment. In 302, an input text is received. The input text includes at least one sentence. The input text may be received directly from, for example, writing, typing, or copying/pasting, or indirectly from, for example, speech recognition or image recognition. In 304, the received input text is pre-processed, such as being divided into sentences, i.e., text tokenization. In some embodiments, the pre-processing may include converting uppercase letters into lowercase letters so that the input text is lowercased. In some embodiments, the pre-processing may include identifying any tokens in the input text that are not in vocabulary database 118 and representing them as special tokens. 302 and 304 may be performed by input pre-processing module 102 of GEC system 100.

In 306, the pre-processed input text is parsed to identify one or more target words in each sentence. The target words may be identified from the text tokensbased on the grammatical error types so that each target word corresponds to at least one of the grammatical error types. The grammatical error types include, but not limited to, the article error, subjective agreement error, verb form error, preposition error, and noun number error. In some embodiments, the actual classification of each target word with respect to the corresponding grammatical error type is determined. The determination may be automatically made, for example, based on PoS tags and text tokens in the sentence. In some embodiments, the target word identification and actual classification determination may be performed by NLP tools such as Stanford corenlp tools. 306 may be performed by parsing module 104 of GEC system 100.

In 308, each target word is dispatched to corresponding classification-based GEC module 108. Each classification-based GEC module 108 includes ANN model 120 trained for a corresponding grammatical error type, for example, over native training samples. 308 may be performed by target word dispatching module 106 of GEC system 100. In 310, one or more grammatical errors in each sentence are detected using ANN models 120. In some embodiments, for each target word, a classification of the target word with respect to the corresponding grammatical error type may be estimated using corresponding ANN model 120. Agrammatical error then may be detected based on the target word and the estimated classification of the target word. For example, if the estimation is different from the original label and the probability is larger than the predefined threshold, then the grammatical error is deemed to be found. 310 may be performed by classification-based GEC modules 108 of GEC system 100.

In 312, one or more grammatical errors in each sentence may be detected using machine translation. 312 may be performed by machine translation-based GEC module 110 of GEC system 100. In 314, one or more grammatical errors in each sentence may be detected based on predefined rules. 314 may be performed by rule-based GEC module 112 of GEC system 100. In some embodiments, a pipeline architecture may be used to combine any suitable machine translation and/or predefined rule-based methods with the classification-based methods described herein to further improve the performance of GEC system 100.

In 316, corrections to the detected grammatical errors and/or a grammar score of the input text are provided. In some embodiments, a weight may be applied to each grammatical error result of target words based on the corresponding grammatical error type. The grammarscore of each sentence can be determined based on the grammatical error results and the target words in the sentence as well as the weights applied to each grammatical error result. In some embodiments, the grammarscore may be provided based on the information associated with the user from whom the sentence is received as well. As to the corrections to the detected grammatical errors, in some embodiments, the estimated classification of the target word with respect to the corresponding grammatical error type may be used to generate the correction. It is to be appreciated that the corrections and grammarscore are not necessary provided together. 316 may be performed by scoring/correction module 114 of GEC system 100.

FIG. 4 is a block diagram illustrating an example of classification-based GEC module 108 of GEC system 100 in FIG. 1 in accordance with an embodiment. As described above, classification-based GEC module 108 is configured to receive a target word in asentence 402 and estimate the classification of the target word using ANN model 120 for the corresponding grammatical error type of the target word. The target word in sentence 402 is also received by a target word labeling unit 404 (e.g., in parsing module 104) . Target word labeling unit 404 is configured to determine the actual classification (e.g., original label) of the target word based on, for example PoS tags and text tokens of sentence 402. Classification-based GEC module 108 is further configured to provide the grammatical error result based on the estimated classification and actual classification of the target word. As shown in FIG. 4, classification-based GEC module 108 includes an initial context generation unit 406, a deep context representation unit 408, a classification unit 410, an attention unit 412, and a classification comparison unit 414.

Initial context generation unit 406 is configured to generate a plurality of sets of initial context vectors (initial context matrices) of the target word based on the words surrounding the target word (context words) in sentence 402. In some embodiments, the initial context vector sets include a set of forwardinitial context vectors (forward initial context matrix) generated based on at least one word before the target word (forward context words) in sentence 402 and a set of backward initial context vectors (backward initial context matrix) generated based on at least one word after the target word (backward context words) in sentence 402. Each initial context vector represents one context word in sentence 402. In some embodiments, an initial context vector may be a one-hot vector that represents a word based on one-hot encoding so that the size (dimension) of the one-hot vector is the same as the vocabulary size (e.g., in vocabulary database 118) . In some embodiments, an initial context vector may be a low-dimensional vector with the dimension smaller than the vocabulary size, such as a word embedding vector of a context word. For example, a word embedding vector may be generated by any suitable generic word embedding approaches, such as but not limited to word2vec or Glove. In some embodiments, initial context generation unit 406 may use one or more recurrent neural networks configured to output one or more sets of initial context vectors. The recurrent neural network (s) used by initial context generation unit 406 may be part of ANN model 120.

It is to be appreciated that the number of context words used for generating the set of forward or backward initial context vectors is not limited. In some embodiments, the set of forward initial context vectors are generated based on all the words before the target word in sentence 402, and the set of backward initial context vectors are generated based on all the words after the target word in sentence 402. Since each classification-based GEC module 108 and corresponding ANN model 120 handle a specific grammatical error type, and correction of different types of grammatical errors may need dependencies from different word distances (e.g., a preposition is determined by the words near the target word, while the status of a verb can be affected by the subject far away from the verb) , in some embodiments, the number of context words used to generate the set of forward or backward initial context vectors (i.e., window size) may be determined based on the grammatical error type associated with classification-based GEC module 108 and corresponding ANN model 120.

In some embodiments, an initial context vector may be generated based on the lemma of the target word itself. Lemma is the base form of a word (e.g., words “walk, ” “walks, ” “walked, ” “walking” all have the same lemma “walk. ” ) . For example, for classification-based GEC module 108 and corresponding ANN model 120 associated with the noun number error, in additional to the context words (i.e., words surrounding the target word in sentence 402) , the lemma form of the target noun word may be introduced in the form of an initial lemma context vector as extract context information becausewhether the target word should besingular or plural form is closely related to itself. In some embodiments, the initial context vector of the lemma of the target word may be part of the set of forward initial context vectors or part of the set of backward initial context vectors.

In some known GEC systems, semantic features need to be designed and extracted manually from the sentence to generated feature vectors, which are difficult to cover all situations due to the complexity of language. In contrast, complex feature engineering is not needed by classification-based GEC module 108 disclosed herein as the context words of the target word in sentence 402 are used directly as the initial context information (e.g., in the form of initial context vectors) , and the deep context feature representation can be learnt jointly with classification in an end-to-end fashion as described below in detail.

Referring to FIG. 5, in this example, a sentence consists of n words 1-n, including the target word i. For each word before the target word i, i.e., word 1, word 2, …, or word i-1, a

correspondinginitial context vector

1, 2, …, or i-1 is generated. The

initial context vectors

1, 2, …, and i-1 are “forward” vectors as they are generated from the words before the target word iand are to be fed into the later stage in aforward direction (i.e., from the beginning of the sentence, i.e., the first word 1) . For each word after the target word i, i.e., word i+1, word i+2, …, or word n, a correspondinginitial context vector i+1, i+2, …, or n is generated. The initial context vectors n, …, i+2, and i+1 are “backward” vectors as they are generated from the words after the target word i and are to be fed into the later stage in a backward direction (i.e., from the end of the sentence, i.e., the last word n) .

In this example, the set of forward initial context vectors may be represented as a forwardinitial context matrix having the number of columns the same as the dimension of word embedding and the number of rows the same as the number of words before the target word i. The first row in the forwardinitial context matrix may be the word embedding vector of the first word 1, and the last row in the forward initial context matrix maybe the word embedding vector of the word i-1 immediately before the target word i. The set of backward initial context vectors may be represented as a backward initial context matrix having the number of columns the same as the dimension of word embedding and the number of rows the same as the number of words after the target word i. The first row in the backward initial context matrix may be the word embedding vector of the last word n, and the last row in the backward initial context matrix maybe the word embedding vector of the word i+1 immediatelyafter the target word i. The number of dimension of each word embedding vector may be at least 100, for example, 300. In this example, a lemma initial context vector lem (e.g., a word embedding vector) may be generated as well based on the lemma of the target word i.

Referring back to FIG. 4, deep context representation unit 408 is configured to provide, using ANN model 120, a context vector of the target word based on the context words in sentence 402, for example, the sets of forward and backward initial context vectors generated by initial context generation unit 406. Classification unit 410 is configured to provide, using ANN mode 120, a classification value of the target word with respect to the grammatical error type based on the deep context representation of the target word in sentence 402, for example, the context vector generated by deep context representation unit 408.

Turningto FIG. 6, a schematic diagram of an example of ANN model 120 is illustratedin accordance with an embodiment. In this example, ANN model 120 includes a deep context representation sub-model 602 that can be used by deep context representation unit 408 and a classification sub-model 604that can be used by classification unit 410. Deep context representation sub-model 602 and classification sub-model 604 may be jointly trained in an end-to-end fashion. Deep context representation sub-model 602 includes two recurrent neuralnetworks: a forward recurrent neural network 606 and a backward recurrent neural network 608. Each recurrent

neural network

606 or 608 may be a long short-term memory (LSTM) neural network, a gated recurrent unit (GRU) neural network, or any other suitable recurrent neural networks where connections between the hidden units form a directed cycle.

Recurrent

neural networks

606 and 608 are configured to output a context vector of the target word based on the initial context vectors generated from the context words of the target word in sentence 402. In some embodiments, forward recurrent neural network 606 is configured to receive the set of forward initial context vectors and provide a forward context vector of the target word based on the set of forwardinitial context vectors. Forward recurrent neural network 606 may be fed with the set of forwardinitial context vectors in theforward direction. Backward recurrent neural network 608 is configured to receive the set of backward initial context vectors and provide a backward context vector of the target word based on the set of backward initial context vectors. Backward recurrent neural network 608 may be fed with the set of backwardinitial context vectors in thebackwarddirection. In some embodiments, the sets of forward and backward initial context vectors may be word embedding vectors as described above. It is to be appreciated that, in some embodiments, the lemma initial context vector of the target word may be fed into forward recurrent neural network 606 and/or backward recurrent neural network 608 to generate the forward context vector and/or backward context vector.

Referring now to FIG. 5, in this example, the forward recurrent neural network is fed with the set of forward initial context vectors (e.g., in the form of forward initial context matrix) in the forward direction and generates a forward context vector for. The backward recurrent neural network is fed with the set of backward initial context vectors (e.g., in the form of backward initial context matrix) in the backward direction and generates a backward context vector back. It is to be appreciated that in some embodiments, the lemma initial context vector lem may be fed into the forward recurrent neural network and/or back recurrent neural network. The number of hidden units in each of the forward and backward recurrent neural networks is at least 300, for example, 600. In this example, a deep context vector i of the target word i is then generated by concatenating the forward context vector forand the backward context vector back. The deep context vector i represents the deep context information of the target word i based on the context words 1 to i-1 andcontext wordsi+1 to nsurrounding the target word i (and the lemma of the target word i in some embodiments) . In other words, the deep context vector i may be considered as the embedding of the joint sentential context around the target word i. As described above, the deep context vector i is a genericrepresentation that can handle various situations as no complex feature engineering is needed to manually design and extract semantic features for representing the context of the target word i.

Turning back to FIG. 6, classification sub-model 604 includes a feedforward neural network 610 configured to output the classification value of the target word with respect to the grammatical error type based on the context vector of the target word. Feedforward neural network 610 may include a multi-layerperceptron (MLP) neural network or any other suitable feedforward neural networks where connections between the hidden units do not form a cycle. For example, as shown in FIG. 5, the deep context vector i is fed into the feedforward neural network to generate the classification value y of the target word i. For different grammatical error types, the classification value y can be defined in different ways as shown in TABLE I. It is to be appreciated that the grammatical error type is not limited to the five examples in TABLE I, and the definition of the classification value y is also not limited by the examples shown in TABLE I. It is also to be appreciated that in some embodiments, the classification value y may be represented as a probability distribution of the target word over the classes (labels) associated with the grammatical error type.

TABLE I

Error Type	Classification Values y
Article	0 ＝ a/an, 1 ＝ the, 2 ＝ none
Preposition	label ＝ preposition index
Verb form	0 ＝ base form, 1 ＝ gerund or present participle, 2 ＝ past participle
Subjectiveagreement	0 ＝ non-3^rd person singular present, 1 ＝ 3^rd person singular present
Noun number	0 ＝ singular, 1 ＝ plural

In some embodiments, feedforward neural network 610 may include a first layer having a first activation function of a fully connected linear operation on the context vector. The first activation function in the first layer may be, for example, the rectified linear unit activation function, or any other suitable activation functions that are functions of one fold output from the previous layer (s) . Feedforward neural network 610 may also include a second layer connected to the first layer and having a second activation function for generating the classification value. The second activation function in the second layer may be, for example, the softmax activation function, or any other suitable activation functions used for multiclass classification.

Returning to FIG. 4, in some embodiments, attention unit 412 is configured to provide, using ANN model 120, a context weight vector of the target word based on at least one word before the target word and at least one word after the target word in sentence 402. FIG. 7 is a schematic diagram illustrating another example of ANN model 120 for grammatical error correction in accordance with an embodiment. Compared with the example shown in FIG. 6, ANN model 120 in FIG. 7 further includes an attention mechanism sub-model 702 that can be used by attention unit 412. The weighted context vector is then computed by applying the context weight vector to the context vector. Deep context representation sub-model 602, classification sub-model 604, and attention mechanism sub-model 702 may be jointly trained in an end-to-end fashion. In some embodiments, attention mechanism sub-model 702 includes a feedforward neural network 704 configured to generate the context weight vector of the target word based on the context words of the target word. Feedforward neural network 704 may be trained based on the distances between each context word to the target word in the sentence. In some embodiments, as the context weight vector can adjust the weights of context words with different distances to the target word, the sets of initial context vectors can be generated based on all the surrounding words in the sentence, and the context weight vector can tune the weighted context vector to focus on those context words that affect grammatical usage.

Returning back to FIG. 4, classification comparison unit 414 is configured to compare the estimated classification value provided by classification unit 410 with the actualclassification value provided by target word labeling unit 404 to detect the presence of any error of the grammatical error type. If the actual classification value is the same as the estimated classification value, then no error of the grammatical error type is detected for the target word. Otherwise, an error of the grammatical error type is detected, and the estimated classification value is used to provide the correction. For example, in the example described above with respect to FIG. 2, the estimatedclassification value of the target word “adding” with respect to the verb form error is “0” (base form) , while the actual classification value of the target word “adding” is “1” (gerund or present participle) . Thus, a verb form error is detected, and the correction is the base form of the target word “adding. ”

FIG. 8 is a detailed schematic diagram illustrating an example of ANN model 120 in FIG. 6 in accordance with an embodiment. In this example, ANN model 120 includes a forward GRU neural network, a backward GRU neural network, and a MLP neutral network that are jointly trained. For the target word “go” in the sentence “I go to school everyday, ” the forward context word “I” is fed to the forward GRU neural network from left to right (the forward direction) , and the backward context words “to school everyday” are fed to the backward GRU neural network from right to left (the backward direction) . Given context w_1: _n, the context vector for the target word w_i can be defined as Equation 1:

where lGRU is a GRU reading the words from left to right (the forward direction) in a given context, and rGRU is a reverse on reading the words from right to left (the backward direction) . l/f represents distinct left-to-right/right-to-left word embedding of the context words. After that, the concatenated vector is fed to the MLP neural network to capture the inter-dependencies of the two sides. At the second layer of MLPneural network, a softmax layer may be used to predict the classification of the target word (e.g., the target word or the status of the target word, e.g., singular or plural) :

MLP (x) ＝ softmax (ReLU (L (x) ) ) ， (2)

where ReLUis the rectified linear unit activation function, ReLU (x) ＝ max (0, x) , L (x) ＝ W (x) +b is a fully connected linear operation. The final output of ANN model 120 in this example is:

y ＝ MLP (biGRU (w_1：n， i) ) ， (3)

where y is the classification value as described above.

FIG. 9 is a flow chart illustrating an example of a method 900 for grammatical error correctionof a sentence in accordance with an embodiment. Method 900 can be performed by processing logic that can comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc. ) , software (e.g., instructions executing on a processing device) , or a combination thereof. It is to be appreciated that not all steps may be needed to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 9, as will be understood by a person of ordinary skill in the art.

Method 900 shall be described with reference to FIGs. 1 and 4. However, method 900 is not limited to that example embodiment. In 902, a sentence is received. The sentence may be part of an input text. 902 may be performed by input pre-processing module 102 of GEC system 100. In 904, one or more target words in the sentence are identified based on one or more grammatical error types. Each target word corresponds to one or more grammatical error types. 904 may be performed by parsing module 104 of GEC system 100. In 906, a classification of one target word with respect to the corresponding grammatical error type is estimated using ANN model 120 trained for the grammatical error type. In 908, a grammatical error is detected based on the target word and the estimated classification of the target word. The detection may be made by comparing the actualclassification of the target word with the estimatedclassification of the target word. 906 and 908 may be performed by classification-based GEC module 108 of GEC system 100.

In 910, whether there are more target words that have not been processed yet in the sentence is determined. If the answer is Yes, then method 900 moves back to 904 to process the next target word in the sentence. Once all the target words in the sentence have been processed, in 912, grammatical error corrections to the sentence are provided based on the grammatical error result. The estimated classifications of each target word may be used for generating the grammatical error corrections. A grammar score may be provided based on the grammatical error result as well. 912 may be performed by scoring/correction module 114 of GEC system 100.

FIG. 10 is a flow chart illustrating an example of a method 1000for classifying a target word with respect to a grammatical error type in accordance with an embodiment. Method 1000 can be performed by processing logic that can comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc. ) , software (e.g., instructions executing on a processing device) , or a combination thereof. It is to be appreciated that not all steps may be needed to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 10, as will be understood by a person of ordinary skill in the art.

Method 1000 shall be described with reference to FIGs. 1 and 4. However, method 1000 is not limited to that example embodiment. In 1002, acontext vector of a target word is provided based on the context words in the sentence. The context words may be any number of words surrounding the target word in the sentence. In some embodiments, the context words include all the words in the sentence except the target word. In some embodiments, the context words include the lemma of the target word as well. The context vector does not include semantic features extracted from the sentence. 1002 may be performed by deep context representation unit 408 of classification-based GEC module 108.

In 1004, a context weight vector is provided based on the context words in the sentence. In 1006, the context weight vector is applied to the context vector to generate a weighted context vector. The context weight vector may apply a respective weight to each context word in the sentence based on the distance of the context word to the target word. 1004 and 1006 may be performed by attention unit 412 of classification-based GEC module 108.

In 1008, a classification value of the target word with respect to the grammatical error type is provided based on the weighed context vector of the target word. The classification value represents one of the multiple classes associated with a grammatical error type. The classification value may be a probability distribution of the target word over the classes associated with the grammatical error type. 1008 may be performed by classification unit 410 of classification-based GEC module 108.

FIG. 11 is a flow chart illustrating another example of a method 1100 for classifying a target word with respect to a grammatical error type in accordance with an embodiment. Method 1100 can be performed by processing logic that can comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc. ) , software (e.g., instructions executing on a processing device) , or a combination thereof. It is to be appreciated that not all steps may be needed to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 11, as will be understood by a person of ordinary skill in the art.

Method 1100 shall be described with reference to FIGs. 1 and 4. However, method 1100 is not limited to that example embodiment. In 1102, the grammatical error type ofa target word is determined, for example, from a plurality of predefined grammatical error types. In 1104, the window size of the context words is determined based on the grammatical error type. The window size indicates the maximum number of words before the target word and the maximum number of words after the target word in the sentence to be considered as the context words. The window size may vary for different grammatical error types. For example, for the subjective agreement and verb form errors, the entire sentence may be consideredas the context since these two error types usually require dependencies from context words that are far away from the target word. As for the article, preposition, and noun number errors, the window size may be smaller than the entire sentence, such as 3, 5, or 10 for the article error, 3, 5, or 10 for the preposition error, and 10, 15, or 20 for the noun number error.

In 1106, a set of forward word embedding vectors are generated based on the context words before the target word. The number of dimension of each forward word embedding vector may be at least 100, such as 300. The order in which the set of forward word embedding vectors are generated may be from the first word within the window size to the word immediately before the target word (the forward direction) . In 1108, in parallel, a set of backward word embedding vectors are generated based on the context words after the target word. The number of dimension of each backward word embedding vector may be at least 100, such as 300. The order in which the set of backward word embedding vectors are generated may be from the last word within the window size to the word immediatelyafter the target word (the backward direction) . 1102, 1104, 1106, and 1108 may be performed by initial context generation unit 406 of classification-based GEC module 108.

In 1110, a forward context vector is provided based on the set of forward word embedding vectors. The set of forward word embedding vectors may be fed to a recurrent neural network following the order from the forward word embedding vector of the first word within the window size to the forward word embedding vector of the word immediately before the target word (the forward direction) . In 1112, in parallel, a backward context vector is provided based on the set of backward word embedding vectors. The set of backward word embedding vectors may be fed to another recurrent neural network following the order from the backward word embedding vector of the last word within the window size to the backward word embedding vector of the word immediatelyafter the target word the (backward direction) . In 1114, a context vector is provided by concatenating the forward context vector and the backward context vector. 1110, 1112, and 1114 may be performed by deep context representation unit 408 of classification-based GEC module 108.

In 1116, a fully connection linear cooperation is applied to the context vector. In 1118, an activation function of a first layer, for example of a MLP neural network, is applied to the output of the fully connected linear operation. The activation function may be the rectified linear unit activation function. In 1120, another activation function of a second layer, for example of the MLP neural network, is applied to the output of the activation function of the first layer to generate a classificationvalue of the target word with respect to the grammatical error type. Multiclass classification of the target word with respect to the grammatical error type may be performed based on the context vector by the MLP neural network in 1116, 1118, and 1120. 1116, 1118, and 1120 may be performed by classification unit 410 of classification-based GEC module 108.

FIG. 12 is a flow chart illustrating an example of a method 1200 for providing a grammarscore in accordance with an embodiment. Method 1200 can be performed by processing logic that can comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc. ) , software (e.g., instructions executing on a processing device) , or a combination thereof. It is to be appreciated that not all steps may be needed to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 12, as will be understood by a person of ordinary skill in the art.

Method 1200 shall be described with reference to FIGs. 1 and 4. However, method 1200 is not limited to that example embodiment. In 1202, auser factor is determined based on information of the user. The information includes, for example, native language, residency, education level, age, historical scores, etc. In 1204, weights of precision and recall are determined. Precision and recall are commonly used in combination as the main evaluation measure for GEC. The precision P and recall R are defined as follows:

where g is the gold standards of two human annotators for specific grammatical error type and e is the corresponding system edits. There may be overlaps between many other grammatical error types and the verb form error type, so g may be based on the annotations of all grammatical error types when calculatingthe verb form error performance. Weights between precision and recall may be adjusted when combining them together as the evaluation measure. For example, F_0.5, defined in Equation 5, combines both precision and recall, while assigning twice as much weight to precision when accurate feedback is more important than coverage in some embodiments.

It is to be appreciated that Fn, wherein n is between 0 to 1 may be applied in other examples. In some embodiments, the weights for different grammatical error types may vary as well.

In 1206, a scoring function is obtained based on the user factor and the weights. The scoring function may use the user factor and weights (either the same or different for different grammatical error types) as parameters. In 1208, the grammatical error results of each target word in the sentence are received. In 1210, a grammar score is provided based on the grammatical error results and the scoring function. Grammatical error results may be the variables of the scoring function, and the user factor and weights may be the parameters of the scoring function. 1202, 1204, 1206, 1208, and 1210 may be performed by scoring/correction module 114 of GEC system 100.

FIG. 13 is a block diagram illustrating anANN model training system 1300 in accordance with an embodiment. ANN model training system 1300 includes a model training module 1302 configured to train each ANN model 120 for a specific grammatical error type over a set of training samples 1304 based on an objective function 1306 using a trainingalgorithm 1308. In some embodiments, each training sample 1304 may be a native training sample. A native training sample as disclosed herein includes a sentence without a grammatical error, as opposed to a learner training sample that includes a sentence with one or more grammatical errors. Compared with some known GEC systems that require tailored training, i.e., using supervised data as training samples (e.g., learner training samples) , which is limited by the size and availability of the supervised training data, ANN model training system 1300 can utilize the abundant native plain text corpora as training samples 1304 to more effectively and efficiently train ANN model 120. For example, training samples 1304 may be obtained from the wiki dump. It is to be appreciated that training samples 1304 for ANN model training system 1300 are not limited to native training samples. In some embodiments, for certain grammatical error types, ANN model training system 1300 may train ANN model 120 using learner training samples or the combination of native training samples and learner training samples.

FIG. 14 is a depiction of an example of a training sample 1304 used by ANN model training system 1300in FIG. 13. A training sample includes a sentence that is associated with one or more grammatical error types 1, …, n. Although the training sample may be a native training sample without a grammatical error, the sentence can still be associated with grammatical error types because as described above, a particular word is associated with one or more grammatical error types, for example, based on its PoS tag. For example, as long as the sentence includes a verb, the sentence may be associated with for example, the verb form and subjective agreement errors. One or more target words 1, …, m may be associated with each grammatical error type. For example, all the verbs in a sentence are target words with respect to the verb form or subjective agreement error in a training sample. For each target word, it is further associated with two pieces of information: the word embedding vector set (matrix) x, and the actual classification value y. The word embedding vector set x may be generated based on the context words of the target word in the sentence. It is to be appreciated that in some embodiments, the word embedding vector set xmay be any other initial context vector set, such as one-hot vector set. As described above, the actual classification value ymay beone of the class labels with respect to a specific grammatical error type, such as “0” for singular and “1” for plural with respect to the noun number error. The training sample thus includes a word embedding vector set x and an actual classification value y pairs, each of which corresponds to a target word with respect to a grammatical error type in the sentence.

Referring back to FIG. 13, ANN model 120 includes a plurality of parameters that can be jointly adjusted by model training module 1302 when being fed with training samples 1304. Model training module 1302 jointly adjusts the parameters of ANN model 120 to minimize objective function 1306 over training samples 1304 using training algorithm 1308. In the example described above with respect to FIG. 8, the objective function for training ANN model 120 is:

where n is the number of training samples 1304. Training algorithm 1308 may be any suitable iterative optimization algorithm for finding the minimum of objective function 1306, including gradient descent algorithms (e.g., thestochastic gradient descent algorithm) .

FIG. 15 is a flow chart illustrating an example of a method 1500 for ANN model training for grammatical error correction in accordance with an embodiment. Method 1500 can be performed by processing logic that can comprise hardware (e.g., circuitry, dedicated logic, programmable logic, microcode, etc. ) , software (e.g., instructions executing on a processing device) , or a combination thereof. It is to be appreciated that not all steps may be needed to perform the disclosure provided herein. Further, some of the steps may be performed simultaneously, or in a different order than shown in FIG. 15, as will be understood by a person of ordinary skill in the art.

Method 1500 shall be described with reference to FIG. 13. However, method 1500 is not limited to that example embodiment. In 1502, an ANN model for a grammatical error type is provided. The ANN model is for estimating a classification of a target word in a sentence with respect to the grammatical error type. The ANN model may be any ANN models disclosed herein, for examples, the ones illustrated in FIGs 6 and 7. In some embodiments, the ANN models may include two recurrent neural networks configured to output a context vector of the target word based on at least one word before the target word and at least one word after the target word in the sentence. In some embodiments, the context vector does not include a semantic feature of the sentence in the training sample. As described above, the ANN model may include deep context representation sub-model 602 that can be parameterized as forward recurrent neural network 606 and backward recurrent neural network 608. The ANN model may also include a feedforward neural network configured to output a classification value of the target word based on the context vector of the target word. As described above, the ANN model may include classification sub-model 604 that can be parameterized as feedforward neural network 610.

In 1504, a training sample set is obtained. Each training sample includes a sentence having a target word and an actual classification of the target word with respect to the grammatical error type. In some embodiments, the training sample may include a word embedding matrix of the target word that includes a set of forward word embedding vectors and a set of backward word embedding vectors. Each forward word embedding vector is generated based on a respective context word before the target word, and each backward word embedding vector is generated based on a respective context word after the target word. The number of dimension of each word embedding vector may be at least 100, such as 300.

In 1506, the parameters of the ANN model are jointly adjusted, for example, in an end-to-end fashion. In some embodiments, a first set of parameters of deep context representation sub-model 602, associated with recurrent

neural networks

606 and 608 are jointly adjusted with a second set of parameters of classification sub-model 604, associated with feedforward neural network 610 based on differences between the actual classifications and estimated classifications of the target words in each training sample. In some embodiments, the parameters associated with forward recurrent neural network 606 are separate from the parameters associated with backward recurrent neural network 608. In some embodiments, the ANN model may also include attention mechanism sub-model 702 that can be parameterized as feedforward neural network 610. The parameters of attention mechanism sub-model 702, associated with feedforward neural network 610, may be jointly adjusted with other parameters of the ANN model as well. In some embodiments, the parameters of the ANN model are jointly adjusted to minimize the differences between the actual classifications and estimated classifications of the target words in each training sample from objective function 1306 using training algorithm 1308.1502, 1504, and 1506 may be performed by model training module 1302 of ANN model training system 1300.

FIG. 16 is a schematic diagram illustrating an example of training ANN model 120for grammatical error correction in accordance with an embodiment. In this example, ANN model 120 is trained over training samples 1304 with respect to a specific grammatical error type. Training examples 1304 may be from native text and pre-processed and parsed as described above with respect to FIG. 1. Each training sample 1304 includes a sentence having a target word with respect to the grammatical error type and the actual classification of the target word with respect to the grammatical error type. In some embodiments, a pair including the word embedding matrix x of the target word and the actual classification value y of the target word may be obtained for each training sample 1304. The word embedding matrix x may include a set of forward word embedding vectors generated based on the context words before the target word and a set of backward word embedding vectors generated based on the context words after the target word. Training samples 1304 thus may include a plurality of (x, y) pairs.

In some embodiments, ANN model 120 may include a plurality of recurrent neural networks 1-n1602 and a plurality of feedforward neural networks 1-m1604. Each of

neural networks

1602 and 1604 is associated with a set of parameters to be trained over training samples 1304 based on objective function 1306 using training algorithm 1308. Recurrent neural networks 1602 may include a forward recurrent neural network and a backward recurrent neural network configured to output a context vector of the target word based on the context words of the target word. In some embodiments, recurrent neural networks 1602 may further include another one or more recurrent neural networks configured to generate the word embedding matrix of the target word based on the context words of the target word. Feedforward neural networks 1604 may include a feedforward neural network configured to output a classification value y’of the target word based on the context vector of the target word. In some embodiments, feedforward neural networks 1604 may also include another feedforward neural network configured to output a context weight vector to be applied to the context vector. Neural networks 1602 and 1604may be connected so that they can be jointly trained in an end-to-end fashion. In some embodiments, the context vector does not include a semantic feature of the sentence in training sample 1304.

In some embodiments, for each iterative, the word embedding matrix x of the target word in corresponding training sample 1304 may be fed into ANN model 120, passing through

neural networks

1602 and 1604. The estimated classification value y’may be outputted from the output layer (e.g., part of a feedforward neural network 1604) of ANN model 120. The estimated classification value y’and the actual classification value y of the target word in corresponding training sample 1304 may be sent to objective function 1306, and the difference between the estimated classification value y’and the actual classification value ymay be used by objective function 1306 using training algorithm 1308 to jointly adjust each set of parameters associated with each of

neural networks

1602 and 1604 in ANN model 120. By iteratively and jointly adjusting each set of parameters associated with each of

neural networks

1602 and 1604 in ANN model 120 over each training sample 1304, the differences between the estimated classification values y’and the actual classification values y are getting smaller, and objective function 1306 is optimized.

Various embodiments can be implemented, for example, using one or more computer systems, such as computer system 1700 shown in FIG. 17. One or more computer system 1700 can be used, for example, to implement method 300 of FIG. 3, method 900 of FIG. 9, method 1000 of FIG. 10, method 1100 of FIG. 11, method 1200 of FIG. 12, and method 1500 of FIG. 15. For example, computer system 1700 can detect and correct grammatical errors and/or train an artificial neural network model for detecting and correcting grammatical errors, according to various embodiments. Computer system 1700 can be any computer capable of performing the functions described herein.

Computer system 1700 can be any well-known computer capable of performing the functions described herein. Computer system 1700 includes one or more processors (also called central processing units, or CPUs) , such as a processor 1704. Processor 1704 is connected to a communication infrastructure or bus 1706. One or more processors 1704 may each be a graphics processing unit (GPU) . In an embodiment, a GPU is a processor that is a specialized electronic circuit designed to process mathematically intensive applications. The GPU may have a parallel structure that is efficient for parallel processing of large blocks of data, such as mathematically intensive data common to computer graphics applications, images, videos, etc.

Computer system 1700 also includes user input/output device (s) 1703, such as monitors, keyboards, pointing devices, etc., that communicate with communication infrastructure 1706 through user input/output interface (s) 1702.

Computer system 1700 also includes a main or primary memory 1708, such as random access memory (RAM) . Main memory 1708 may include one or more levels of cache. Main memory 1708 has stored therein control logic (i.e., computer software) and/or data. Computer system 1700 may also include one or more secondary storage devices or memory 1710. Secondary memory 1710 may include, for example, a hard disk drive 1712 and/or a removable storage device or drive 1714. Removable storage drive 1714 may be a floppy disk drive, a magnetic tape drive, a compact disk drive, an optical storage device, tape backup device, and/or any other storage device/drive. Removable storage drive 1714 may interact with a removable storage unit 1718. Removable storage unit 1718 includes a computer usable or readable storage device having stored thereon computer software (control logic) and/or data. Removable storage unit 1718 may be a floppy disk, magnetic tape, compact disk, DVD, optical storage disk, and/any other computer data storage device. Removable storage drive 1714 reads from and/or writes to removable storage unit 1718 in a well-known manner.

According to an exemplary embodiment, secondary memory 1710 may include other means, instrumentalities or other approaches for allowing computer programs and/or other instructions and/or data to be accessed by computer system 1700. Such means, instrumentalities or other approaches may include, for example, a removable storage unit 1722 and an interface 1720. Examples of the removable storage unit 1722 and the interface 1720 may include a program cartridge and cartridge interface (such as that found in video game devices) , a removable memory chip (such as an EPROM or PROM) and associated socket, a memory stick and USB port, a memory card and associated memory card slot, and/or any other removable storage unit and associated interface.

Computer system 1700 may further include a communication or network interface 1724. Communication interface 1724 enables computer system 1700 to communicate and interact with any combination of remote devices, remote networks, remote entities, etc. (individually and collectively referenced by reference number 1728) . For example, communication interface 1724 may allow computer system 1700 to communicate with remote devices 1728 over communication path 1726, which may be wired and/or wireless, and which may include any combination of LANs, WANs, the Internet, etc. Control logic and/or data may be transmitted to and from computer system 1700 via communication path 1726.

In an embodiment, a tangible apparatus or article of manufacture comprising a tangible computer useable or readable medium having control logic (software) stored thereon is also referred to herein as a computer program product or program storage device. This includes, but is not limited to, computer system 1700, main memory 1708, secondary memory 1710, and

removable storage units

1718 and 1722, as well as tangible articles of manufacture embodying any combination of the foregoing. Such control logic, when executed by one or more data processing devices (such as computer system 1700) , causes such data processing devices to operate as described herein.

Based on the teachings contained in this disclosure, it will be apparent to persons skilled in the relevant art (s) how to make and use embodiments of the present disclosure using data processing devices, computer systems and/or computer architectures other than that shown in FIG. 17. In particular, embodiments may operate with software, hardware, and/or operating system implementations other than those described herein.

It is to be appreciated that the Detailed Description section, and not the Summary and Abstract sections, is intended to be used to interpret the claims. The Summary and Abstract sections may set forth one or more but not all exemplary embodiments of the present disclosure as contemplated by the inventor (s) , and thus, are not intended to limit the present disclosure or the appended claims in any way.

While the present disclosure has been described herein with reference to exemplary embodiments for exemplary fields and applications, it should be understood that the present disclosure is not limited thereto. Other embodiments and modifications thereto are possible, and are within the scope and spirit of the present disclosure. For example, and without limiting the generality of this paragraph, embodiments are not limited to the software, hardware, firmware, and/or entities illustrated in the figures and/or described herein. Further, embodiments (whether or not explicitly described herein) have significant utility to fields and applications beyond the examples described herein.

Embodiments have been described herein with the aid of functional building blocks illustrating the implementation of specified functions and relationships thereof. The boundaries of these functional building blocks have been arbitrarily defined herein for the convenience of the description. Alternate boundaries can be defined as long as the specified functions and relationships (or equivalents thereof) are appropriately performed. Also, alternative embodiments may perform functional blocks, steps, operations, methods, etc. using orderings different than those described herein.

The breadth and scope of the present disclosure should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

A method for grammatical error detection, comprising:

receiving, by at least one processor, a sentence；

identifying, by the at least one processor, one or more target words in the sentence based, at least in part, on one or more grammatical error types, wherein each of the one or more target words corresponds to at least one of the one or more grammatical error types；

for at least one of the one or more target words, estimating, by the at least one processor, a classification of the target word with respect to the corresponding grammatical error type using an artificial neural network model trained for the grammatical error type, wherein the model comprises (i) two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence, and (ii) a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word； and

detecting, by the at least one processor, a grammatical error in the sentence based, at least in part, on the target word and the estimated classification of the target word.
The method of claim 1, the estimating further comprising:

providing the context vector of the target word based, at least in part, on the at least one word before the target word and the at least one word after the target word in the sentence using the two recurrent neural networks； and

providing the classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word using the feedforward neural network.
The method of claim 2, wherein the context vector of the target word is provided based, at least in part, on a lemma of the target word.
The method of claim 2, the estimating further comprising:

generating a first set of word embedding vectors, wherein each word embedding vector in the first set of word embedding vectors is generated based, at least in part, on a respective one of the at least one word before the target word in the sentence； and

generating a second set of word embedding vectors, wherein each word embedding vector in the second set of word embedding vectors is generated based, at least in part, on a respective one of the at least one word after the target word in the sentence.
The method of claim 4, wherein the number of dimension of each word embedding vector is at least 100.
The method of claim 1, wherein:

the at least one word before the target word comprises all words before the target word in the sentence； and

the at least one word after the target word comprises all words after the target word in the sentence.
The method of claim 1, wherein the number of the at least one word before the target word and/or the number of the at least one word after the target word are determined based, at least in part, on the grammatical error type.
The method of claim 2, the estimating further comprising:

providing a context weight vector of the target word based, at least in part, on the at least one word before the target word and the at least one word after the target word in the sentence； and

applying the context weight vector to the context vector.
The method of claim 4, the providing of the context vector further comprising:

providing a first context vector of the target word based, at least in part, on the first set of word embedding vectors using a first one of the two recurrent neural networks；

providing a second context vector of the target word based, at least in part, on the second set of word embedding vectors using a second one of the two recurrent neural networks； and

providing the context vector by concatenating the first and second context vectors.
The method of claim 9, wherein:

the first set of word embedding vectors are provided to the first recurrent neural network starting from the word embedding vector of a word at the beginning of the sentence； and

the second set of word embedding vectors are provided to the second recurrent neural network starting from the word embedding vector of a word at the end of the sentence.
The method of claim 1, wherein the number of hidden units in each of the two recurrent neural networks is at least 300.
The method of claim 1, wherein the feedforward neural network comprises:

a first layer having a first activation function of a fully connected linear operation on the context vector； and

a second layer connected to the first layer and having a second activation function for generating the classification value.
The method of claim 1, wherein the classification value is a probability distribution of the target word over a plurality of classes associated with the grammatical error type.
The method of claim 1, the detecting further comprising:

comparing the estimated classification of the target word with an actual classification of the target word； and

detecting the grammatical error in the sentence when the actual classification does not match the estimated classification of the target word.
The method of claim 1, further comprising:

in response to detecting the grammatical error in the sentence, providing a grammatical error correction of the target word based, at least in part, on the estimated classification of the target word.
The method of claim 1, further comprising:

for each of the one or more target words, estimating a respective classification of the target word with respect to the corresponding grammatical error type using a respective artificial neural network model trained for the grammatical error type, and comparing the estimated classification of the target word with an actual classification of the target word to generate a grammatical error result of the target word；

applying a weight to each of the grammatical error results of the one or more target words based, at least in part, on the corresponding grammatical error type； and

providing a grammarscore of the sentence based on the grammatical error results of the one or more target words and the weights.
The method of claim 16, wherein the grammarscore is provided based, at least in part, on information associated with a user from whom the sentence is received.
The method of claim 1, wherein the model is trained by native training samples.
The method of claim 1, wherein the two recurrent neural networks and the feedforward neural network are jointly trained.
The method of claim 1, wherein the model further comprises:

another recurrent neural network configured to output a set of initial context vectors to be inputted to the two recurrent neural networks for generating the context vector； and

another feedforward neural network configured to output a context weight vector to be applied to the context vector.
The method of claim 20, wherein all the recurrent neural networks and feedforward neural network are jointly trained by native training samples.
A system for grammatical error detection, comprising:

a memory； and

at least one processor coupled to the memory and configured to:

receive a sentence；

identify one or more target words in the sentence based, at least in part, on one or more grammatical error types, wherein each of the one or more target words corresponds to at least one of the one or more grammatical error types；

for at least one of the one or more target words, estimate a classification of the target word with respect to the corresponding grammatical error type using an artificial neural network model trained for the grammatical error type, wherein the model comprises (i) two recurrent neural networks configured to generate a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence, and (ii) a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word； and

detect a grammatical error in the sentence based, at least in part, on the target word and the estimated classification of the target word.
The systemof claim 22, wherein to estimate a classification of the target word the at least one processor is configured to:

provide the context vector of the target word based, at least in part, on the at least one word before the target word and the at least one word after the target word in the sentence using the two recurrent neural networks； and

provide the classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word using the feedforward neural network.
The systemof claim 23, wherein the context vector of the target word is provided based, at least in part, on a lemma of the target word.
The systemof claim 23, wherein to estimate a classification of the target word, the at least one processor is configured to:

generate a first set of word embedding vectors, wherein each word embedding vector in the first set of word embedding vectors is generated based, at least in part, on a respective one of the at least one word before the target word in the sentence； and

generate a second set of word embedding vectors, wherein each word embedding vector in the second set of word embedding vectors is generated based, at least in part, on a respective one of the at least one word after the target word in the sentence.
The systemof claim 25, wherein the number of dimension of each word embedding vector is at least 100.
The systemof claim 22, wherein:

the at least one word before the target word comprises all words before the target word in the sentence； and

the at least one word after the target word comprises all words after the target word in the sentence.
The systemof claim 22, wherein the number of the at least one word before the target word and/or the number of the at least one word after the target word are determined based, at least in part, on the grammatical error type.
The systemof claim 23, wherein to estimate a classification of the target word the at least one processor is configured to:

provide a context weight vector of the target word based, at least in part, on the at least one word before the target word and the at least one word after the target word in the sentence； and

apply the context weight vector to the context vector.
The systemof claim 25, wherein to provide the context vector of the target word the at least one processor is configured to:

providing a first context vector of the target word based, at least in part, on the first set of word embedding vectors using a first one of the two recurrent neural networks；

providing a second context vector of the target word based, at least in part, on the second set of word embedding vectors using a second one of the two recurrent neural networks； and

providing the context vector by concatenating the first and second context vectors.
The systemof claim 30, wherein:

the first set of word embedding vectors are provided to the first recurrent neural network starting from the word embedding vector of a word at the beginning of the sentence； and

the second set of word embedding vectors are provided to the second recurrent neural network starting from the word embedding vector of a word at the end of the sentence.
The systemof claim 22, wherein the number of hidden units in each of the two recurrent neural networks is at least 300.
The systemof claim 22, wherein the feedforward neural network comprises:

a first layer having a first activation function of a fully connected linear operation on the context vector； and

a second layer connected to the first layer and having a second activation function for generating the classification value.
The systemof claim 22, wherein the classification value is a probability distribution of the target word over a plurality of classes associated with the grammatical error type.
The systemof claim 22, wherein to detect a grammatical error the at least one processor is configured to:

compare the estimated classification of the target word with an actual classification of the target word； and

detect the grammatical error in the sentence when the actual classification does not match the estimated classification of the target word.
The systemof claim 22, the at least one processor further configured to:

in response to detecting the grammatical error in the sentence, provide a grammatical error correction of the target word based, at least in part, on the estimated classification of the target word.
The systemof claim 22, the at least one processor further configured to:

for each of the one or more target words, estimate a respective classification of the target word with respect to the corresponding grammatical error type using a respective artificial neural network model trained for the grammatical error type, and comparing the estimated classification of the target word with an actual classification of the target word to generate a grammatical error result of the target word；

apply a weight to each of the grammatical error results of the one or more target words based, at least in part, on the corresponding grammatical error type； and

provide a grammarscore of the sentence based on the grammatical error results of the one or more target words and the weights.
The systemof claim 37, wherein the grammarscore is provided based, at least in part, on information associated with a user from whom the sentence is received.
The systemof claim 22, wherein the model is trained by native training samples.
The systemof claim 22, wherein the two recurrent neural networks and the feedforward neural network are jointly trained.
The systemof claim 22, wherein the model further comprises:

another recurrent neural network configured to output a set of initial context vectors to be inputted to the two recurrent neural networks for generating the context vector； and

another feedforward neural network configured to output a context weight vector to be applied to the context vector.
The systemof claim 41, wherein all the recurrent neural networks and feedforward neural network are jointly trained by native training samples.
A tangible computer-readable device having instructions stored thereon that, when executed by at least one computing device, causes the at least one computing device to perform operations comprising:

receiving a sentence；

identifying one or more target words in the sentence based, at least in part, on one or more grammatical error types, wherein each of the one or more target words corresponds to at least one of the one or more grammatical error types；

for at least one of the one or more target words, estimating a classification of the target word with respect to the corresponding grammatical error type using an artificial neural network model trained for the grammatical error type, wherein the model comprises (i) two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence, and (ii) a feedforward neural network configured to output a classification value of the target word with respect to the grammatical error type based, at least in part, on the context vector of the target word； and

detecting a grammatical error in the sentence based, at least in part, on the target word and the estimated classification of the target word.
A method for training an artificial neural network model, comprising:

providing, by at least one processor, an artificial neural network model for estimating a classification of a target word in a sentence with respect to a grammatical error type, wherein the model comprises (i) two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence, and (ii) a feedforward neural network configured to output a classification value of the target word based, at least in part, on the context vector of the target word；

obtaining, by the at least one processor, a set of training samples, wherein each training sample in the set of training samples comprises a sentence comprising a target word with respect to the grammatical error type and an actual classification of the target word with respect to the grammatical error type； and

jointly adjusting, by the at least one processor, a first set of parameters associated with the recurrent neural networks and a second set of parameters associated with the feedforward neural network based, at least in part, on differences between the actual classifications and estimated classifications of the target words in each training sample.
The method of claim 44, wherein each training sample is a native training sample without a grammatical error.
The method of claim 44, wherein the recurrent neural networks are gated recurrent unit (GRU) neural networks, and the feedforward neural network is a multilayer perception (MLP) neural network.
The method of claim 44, wherein the model further comprises:

another feedforward neural network configured to output a context weight vector to be applied to the context vector.
The method of claim 47, the jointly adjusting comprising:

jointly adjusting the first and second sets of parameters and a third set of parameters associated with the another feedforward neural network based, at least in part, on the differences between the actual classifications and estimated classifications of the target words in each training sample.
The method of claim 44, further comprising: for each training sample,

generating a first set of word embedding vectors, wherein each word embedding vector in the first set of word embedding vectors is generated based, at least in part, on a respective one of at least one word before the target word in the training sample； and

generating a second set of word embedding vectors, wherein each word embedding vector in the second set of word embedding vectors is generated based, at least in part, on a respective one of at least one word after the target word in the training sample.
The method of claim 49, wherein the number of dimension of each word embedding vector is at least 100.
The method of claim 49, wherein:

the at least one word before the target word comprises all words before the target word in the sentence； and

the at least one word after the target word comprises all words after the target word in the sentence.
The method of claim 49, further comprising: for each training sample,

providing a first context vector of the target word based, at least in part, on the first set of word embedding vectors using a first one of the two recurrent neural networks；

providing a second context vector of the target word based, at least in part, on the second set of word embedding vectors using a second one of the two recurrent neural networks； and

providing the context vector by concatenating the first and second context vectors.
The method of claim 52, wherein:

the first set of word embedding vectors are provided to the first recurrent neural network starting from the word embedding vector of a word at the beginning of the sentence； and

the second set of word embedding vectors are provided to the second recurrent neural network starting from the word embedding vector of a word at the end of the sentence.
The method of claim 52, wherein the first and second context vectors do not comprise a semantic feature of the sentence in the training sample.
The method of claim 44, wherein the number of hidden units in each of the two recurrent neural networks is at least 300.
The method of claim 44, wherein the feedforward neural network comprises:

a first layer having a first activation function of a fully connected linear operation on the context vector； and

a second layer connected to the first layer and having a second activation function for generating the classification value.
A system for training an artificial neural network model, comprising:

a memory； and

at least one processor coupled to the memory and configured to:

provide an artificial neural network model for estimating a classification of a target word in a sentence with respect to a grammatical error type, wherein the model comprises (i) two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence, and (ii) a feedforward neural network configured to output a classification value of the target word based, at least in part, on the context vector of the target word；

obtain a set of training samples, wherein each training sample in the set of training samples comprises a sentence comprising a target word with respect to the grammatical error type and an actual classification of the target word with respect to the grammatical error type； and

jointly adjust a first set of parameters associated with the recurrent neural networks and a second set of parameters associated with the feedforward neural network based, at least in part, on differences between the actual classifications and estimated classifications of the target words in each training sample.
The systemof claim 57, wherein each training sample is a native training sample without a grammatical error.
The systemof claim 57, wherein the recurrent neural networks are GRU neural networks, and the feedforward neural network is a MLP neural network.
The systemof claim 57, wherein the model further comprises:

another feedforward neural network configured to output a context weight vector to be applied to the context vector.
The systemof claim 60, wherein to jointly adjusta first set of parameters and a second set of parameters the at least one processor is configured to:

jointly adjust the first and second sets of parameters and a third set of parameters associated with the another feedforward neural network based, at least in part, on the differences between the actual classifications and estimated classifications of the target words in each training sample.
The systemof claim 57, the at least one processor further configured to: for each training sample,

generate a first set of word embedding vectors, wherein each word embedding vector in the first set of word embedding vectors is generated based, at least in part, on a respective one of at least one word before the target word in the training sample； and

generate a second set of word embedding vectors, wherein each word embedding vector in the second set of word embedding vectors is generated based, at least in part, on a respective one of at least one word after the target word in the training sample.
The systemof claim 62, wherein the number of dimension of each word embedding vector is at least 100.
The systemof claim 62, wherein:

the at least one word before the target word comprises all words before the target word in the sentence； and

the at least one word after the target word comprises all words after the target word in the sentence.
The systemof claim 62, the at least one processor further configured to: for each training sample,

provide a first context vector of the target word based, at least in part, on the first set of word embedding vectors using a first one of the two recurrent neural networks；

provide a second context vector of the target word based, at least in part, on the second set of word embedding vectors using a second one of the two recurrent neural networks； and

provide the context vector by concatenating the first and second context vectors.
The systemof claim 65, wherein:

the first set of word embedding vectors are provided to the first recurrent neural network starting from the word embedding vector of a word at the beginning of the sentence； and

the second set of word embedding vectors are provided to the second recurrent neural network starting from the word embedding vector of a word at the end of the sentence.
The systemof claim 65, wherein the first and second context vectors do not comprise a semantic feature of the sentence in the training sample.
The systemof claim 57, wherein the number of hidden units in each of the two recurrent neural networks is at least 300.
The systemof claim 57, wherein the feedforward neural network comprises:

a first layer having a first activation function of a fully connected linear operation on the context vector； and

a second layer connected to the first layer and having a second activation function for generating the classification value.
A tangible computer-readable device having instructions stored thereon that, when executed by at least one computing device, causes the at least one computing device to perform operations comprising:

providingan artificial neural network model for estimating a classification of a target word in a sentence with respect to a grammatical error type, wherein the model comprises (i) two recurrent neural networks configured to output a context vector of the target word based, at least in part, on at least one word before the target word and at least one word after the target word in the sentence, and (ii) a feedforward neural network configured to output a classification value of the target word based, at least in part, on the context vector of the target word；

obtaininga set of training samples, wherein each training sample in the set of training samples comprises a sentence comprising a target word with respect to the grammatical error type and an actual classification of the target word with respect to the grammatical error type； and

jointly adjustinga first set of parameters associated with the recurrent neural networks and a second set of parameters associated with the feedforward neural network based, at least in part, on differences between the actual classifications and estimated classifications of the target words in each training sample.