[go: up one dir, main page]

CN115422926A - Model training method, text classification method, device, medium and electronic equipment - Google Patents

Model training method, text classification method, device, medium and electronic equipment Download PDF

Info

Publication number
CN115422926A
CN115422926A CN202210880495.6A CN202210880495A CN115422926A CN 115422926 A CN115422926 A CN 115422926A CN 202210880495 A CN202210880495 A CN 202210880495A CN 115422926 A CN115422926 A CN 115422926A
Authority
CN
China
Prior art keywords
text
training
enhanced
abnormal
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210880495.6A
Other languages
Chinese (zh)
Inventor
李首贤
刘洋
张睿
肖科
但红卫
袁立强
刘庆生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN202210880495.6A priority Critical patent/CN115422926A/en
Publication of CN115422926A publication Critical patent/CN115422926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the disclosure provides a model training method, a text classification device, a medium and electronic equipment for abnormal text classification, and relates to the technical field of natural language processing. The method comprises the following steps: acquiring an initial text training set; performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing includes at least one of feature enhancement processing and data enhancement processing; generating an enhanced text training set according to the initial text training set and the enhanced training text; the enhanced text training set is used for training an abnormal text classification model. The method and the device have the advantages that the training texts adopted by model training are subjected to processing such as feature enhancement, data enhancement and the like, and the distance between the positive sample and the negative sample is increased, so that the text classification result obtained by the abnormal text classification model obtained by training is more accurate.

Description

Model training method, text classification method, device, medium and electronic equipment
Technical Field
Embodiments of the present disclosure relate to the field of natural language processing technologies, and in particular, to a model training method for classifying abnormal texts, a text classification method, a model training apparatus for classifying abnormal texts, a text classification apparatus, a computer-readable storage medium, and an electronic device.
Background
This section is intended to provide a background or context to the embodiments of the disclosure recited in the claims. The description herein is not admitted to be prior art by inclusion in this section.
Natural Language Processing (NLP) refers to Processing information such as the shape, sound, and meaning of a Natural Language, i.e., operations and processes for inputting, outputting, recognizing, analyzing, understanding, and generating characters, words, sentences, and chapters, by a computer.
Natural language processing may be applied to identify scenes of abnormal text from a large amount of text information, such as recognition scenes of spam. The electronic mail box is closely related to the work and life of people, mail users can receive a large amount of junk mails every day, a part of the junk mails belong to harmful mails, the common harmful mails comprise ticket-type mails, yellow-related mails, gambling-related mails and the like, and the harmful mails bring bad experience and even property loss to the users.
Disclosure of Invention
Although the existing email anti-spam system already shields a large amount of harmful emails for users, the users still can receive a lot of harmful emails, the anti-spam system still can be reported by a large amount of harmful emails every day, the filtering of the junk emails cannot reach the satisfaction degree of the users, and the user experience is poor.
Therefore, the improved model training method for abnormal text classification is provided, so that the training texts adopted by model training are subjected to feature enhancement, data enhancement and the like, the distance between positive and negative samples is increased, and the text classification result output by the abnormal text classification model obtained through training is more accurate.
In this context, embodiments of the present disclosure desirably provide a model training method for abnormal text classification, a model training apparatus for abnormal text classification, a computer-readable storage medium, and an electronic device.
In a first aspect of the embodiments of the present disclosure, a model training method for abnormal text classification is provided, including: acquiring an initial text training set; performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing comprises at least one of feature enhancement processing and data enhancement processing; generating an enhanced text training set according to the initial text training set and the enhanced training text; and the enhanced text training set is used for training an abnormal text classification model.
In a second aspect of embodiments of the present disclosure, there is provided a text classification method, including: acquiring a text to be identified; inputting the text to be recognized into a pre-trained abnormal text classification model, and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on a model training method for abnormal text classification; and determining the text classification result of the text to be recognized according to the output result of the abnormal text classification model.
In a third aspect of the embodiments of the present disclosure, there is provided a model training apparatus for abnormal text classification, including: the initial training set acquisition module is used for acquiring an initial text training set; the information enhancement processing module is used for carrying out information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing comprises at least one of feature enhancement processing and data enhancement processing; the enhanced training set generating module is used for generating an enhanced text training set according to the initial text training set and the enhanced training text; and the enhanced text training set is used for training an abnormal text classification model.
In a fourth aspect of the disclosed embodiments, there is provided a text classification apparatus comprising: the text to be recognized acquisition module is used for acquiring a text to be recognized; the text classification module is used for inputting the text to be recognized into a pre-trained abnormal text classification model and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on a model training method for abnormal text classification; and the result determining module is used for determining the text classification result of the text to be recognized according to the output result of the abnormal text classification model.
In a fifth aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which, when being executed by a processor, implements the model training method for abnormal text classification as described above.
In a sixth aspect of embodiments of the present disclosure, there is provided an electronic device comprising: a processor; and a memory having computer readable instructions stored thereon which, when executed by the processor, implement the model training method for abnormal text classification as described above.
According to the technical scheme of the embodiment of the disclosure, on one hand, the distance between the positive sample and the negative sample can be increased by performing feature enhancement and data enhancement processing on the initial text training set, more representations are processed from the initial text training set, and the data and the quality of the training set are improved. On the other hand, the abnormal text classification model obtained by training the enhanced text training set can better learn the respective characteristics of the positive and negative samples, so that the text classification result obtained by the abnormal text classification model is more accurate.
Drawings
The above and other objects, features and advantages of exemplary embodiments of the present disclosure will become readily apparent from the following detailed description read in conjunction with the accompanying drawings. Several embodiments of the present disclosure are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:
FIG. 1 schematically illustrates a content diagram of various types of spam;
fig. 2 schematically illustrates a schematic block diagram of a system architecture of an exemplary application scenario, in accordance with some embodiments of the present disclosure;
FIG. 3 schematically illustrates a flow diagram of a method of model training for abnormal text classification, in accordance with some embodiments of the present disclosure;
FIG. 4 schematically illustrates a schematic diagram of a de-word process on text in an initial training set of text, according to some embodiments of the present disclosure;
FIG. 5 schematically illustrates a schematic diagram of determining syllable strings of text in an initial training set of text, according to some embodiments of the present disclosure;
FIG. 6 schematically illustrates a schematic diagram of homophone replacement according to some embodiments of the present disclosure;
FIG. 7 schematically illustrates a schematic diagram of equivalence word replacement, according to some embodiments of the present disclosure;
FIG. 8 schematically illustrates a schematic diagram of counterfeiting of compromised mail with normal mail, according to some embodiments of the present disclosure;
FIG. 9 schematically illustrates a block diagram of a model for anomaly text classification, in accordance with some embodiments of the present disclosure;
FIG. 10 schematically illustrates a flow diagram of a text classification method according to some embodiments of the present disclosure;
FIG. 11 schematically illustrates a block schematic diagram of a model training apparatus for abnormal text classification, in accordance with some embodiments of the present disclosure;
FIG. 12 schematically illustrates a schematic block diagram of a text classification apparatus according to some embodiments of the present disclosure;
FIG. 13 schematically shows a schematic diagram of a storage medium according to an example embodiment of the present disclosure; and
fig. 14 schematically shows a block diagram of an electronic device according to an exemplary embodiment of the invention.
In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.
Detailed Description
The principles and spirit of the present disclosure will be described with reference to a number of exemplary embodiments. It is understood that these embodiments are presented merely to enable those skilled in the art to better understand and to practice the disclosure, and are not intended to limit the scope of the disclosure in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be embodied as a system, apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.
According to an embodiment of the disclosure, a model training method for abnormal text classification, a text classification method, a model training device for abnormal text classification, a text classification device, a medium and an electronic device are provided.
In this context, it is understood that the terms involved, such as a vocabulary, may be generated from a training data set, and the characters or strings received by the model in the training data set may be placed in the vocabulary in a one-to-one correspondence to be understood by the computer. The Out of Vocabulary (OOV) problem may refer to that characters encountered during deep learning model prediction do not exist in a Vocabulary (the Vocabulary is generated from a training set). An encoder in deep learning (Tokenizer), which can be regarded as a word segmentation device, can convert an input character or character string into a uniquely corresponding number that can be recognized by a computer. The text identifier (token _ id) may be a number uniquely corresponding to each character or character string in an encoder in deep learning (Tokenizer encoder).
Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.
The principles and spirit of the present disclosure are explained in detail below with reference to several representative embodiments of the present disclosure.
Summary of The Invention
Among the junk mails currently received by people, the mails invoiced for the generations in the junk mails are the most serious and the largest. Generally, the generation invoicing mail includes several types: referring to fig. 1, fig. 1 (1) is a picture added to a normal mail, in which information for invoicing is provided, but through the now mature character recognition (OCR) technology, a problem in the picture can be extracted and then a text can be recognized. Fig. 1 (2) shows the past relatively common generation invoicing mail, and the keywords of the generation invoicing mail are all performed around the common junk words of generation, issuance and invoicing, and the generation invoicing mail can be easily learned and identified by using a traditional deep learning model in both fig. 1 (1) and fig. 1 (2).
FIG. 1 (3) shows a modified generation invoice mail, in which FIG. 1 (3) changes the "ticket" word in the mail into a "dummy" word, the sender attempts to avoid anti-spam keyword capture by means of such font deformation, and if the deep learning model does not meet the word at the time of prediction, the deep learning model treats the word as an unregistered word, which causes an OOV problem, and the model is likely to fail to intercept the mail. In the mail of the type shown in fig. 1 (4), the sender uses a normal electronic invoice mail to pretend, which not only has the OOV problem shown in fig. 1 (3), but also may cause the model to misjudge the normal invoice mail. FIG. 1 (5) is yet another form of spam email, incorporating homophones and Pinyin, and continuing to increase the probability that such email will pass through an anti-spam system.
These new generation invoiced mailings (3) - (5) of fig. 1 further add to the difficulty of capture by the anti-spam system and present new challenges to the model that render the spam system unable to effectively identify and correctly classify such mailings.
Based on the above, the basic idea of the present disclosure is to obtain an initial text training set; performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing includes at least one of feature enhancement processing and data enhancement processing; generating an enhanced text training set according to the initial text training set and the enhanced training text; the enhanced text training set is used for training an abnormal text classification model. According to the method, the distance between the positive sample and the negative sample is increased by processing the training text adopted by the model training, such as feature enhancement and data enhancement, so that the text classification result output by the abnormal text classification model obtained by training is more accurate.
Having described the general principles of the present disclosure, various non-limiting embodiments of the present disclosure are described in detail below.
Application scene overview
Referring first to fig. 2, fig. 2 is a schematic block diagram illustrating a system architecture of an exemplary application scenario to which a model training method and apparatus for abnormal text classification according to an embodiment of the present disclosure may be applied.
As shown in fig. 2, the system architecture 200 may include one or more of terminal devices 201, 202, 203, a network 204, and a server 205. The network 204 serves as a medium for providing communication links between the terminal devices 201, 202, 203 and the server 205. Network 204 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few. The terminal devices 201, 202, 203 may be various electronic devices having a display screen, including but not limited to desktop computers, portable computers, smart phones, tablet computers, and the like. It should be understood that the number of terminal devices, networks, and servers in fig. 2 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, the server 205 may be a server cluster composed of a plurality of servers.
The model training method and the text classification method for abnormal text classification provided by the embodiment of the present disclosure are generally executed by the server 205, and accordingly, the model training device and the text classification device for abnormal text classification are generally disposed in the server 205. However, it is easily understood by those skilled in the art that the model training method and the text classification method for classifying the abnormal text provided in the embodiment of the present disclosure may also be executed by the terminal devices 201, 202, and 203, and accordingly, the model training device and the text classification device for classifying the abnormal text may also be disposed in the terminal devices 201, 202, and 203, which is not particularly limited in the present exemplary embodiment. For example, in an exemplary embodiment, a worker uploads a text to be recognized to the server 205 through the terminal devices 201, 202, and 203, the server inputs the text to be recognized into the abnormal text classification model through the text classification method provided by the embodiment of the present disclosure, the abnormal text classification model may be obtained through training by the server 205, and after determining a text classification result of the text to be recognized, the abnormal text classification model transmits the recognized text to the terminal devices 201, 202, and 203, so that the terminal devices 201, 202, and 203 display the recognized text to a user, and performs filtering processing on the text that is not recognized.
It should be understood that the application scenario illustrated in fig. 2 is only one example in which embodiments of the present disclosure may be implemented. The application scope of the embodiments of the present disclosure is not limited in any way by the application scenario.
Exemplary method
In conjunction with the application scenario of fig. 2, a model training method for abnormal text classification according to an exemplary embodiment of the present disclosure is described below with reference to fig. 3. It should be noted that the above application scenarios are merely illustrated for the convenience of understanding the spirit and principles of the present disclosure, and the embodiments of the present disclosure are not limited in this respect. Rather, embodiments of the present disclosure may be applied to any scenario where applicable.
The present disclosure first provides a model training method for classifying abnormal texts, where an execution subject of the method may be a terminal device or a server, and the present disclosure is not particularly limited to this, and in this example embodiment, the server executes the method as an example for description.
Referring to fig. 3, the model training method for abnormal text classification may include the following steps S310 to S330:
step S310, an initial text training set is obtained.
In some example embodiments, the initial training set of text may include a set of training samples determined based on the historical received abnormal text.
Before model training, an initial training set of texts may be obtained, and the initial training set of texts may include a text sample set formed by the text contents generated by history. In order to enable the model to have better classification effect, the information enhancement processing can be carried out on the basis of the determined initial text training set. For example, in an anti-spam scenario, the training purpose of the model is to identify the emails containing the content of "generation invoicing", so the initial text training set may be formed by the normal emails received in the past in the anti-spam system and the abnormal emails containing the content related to the generation invoicing. Since the subject of the anti-spam system identification is spam, it is possible to take an abnormal email containing the contents of the generation invoice as a positive sample, and an email containing the contents of the normal invoice and an email containing other contents as a negative sample.
Step S320, performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing includes at least one of feature enhancement processing and data enhancement processing.
In some example embodiments, the information enhancement process may be to process more representations from the initial text training set without substantially increasing the data of the initial text training set, increasing the data and quality of the initial text training set to approximate the value that would result from more data volume. The principle of information enhancement is that prior knowledge is fused into an initial text training set to process more data representation, statistical noise in model discrimination data is facilitated, and model overfitting is reduced. The enhanced training text can be a training text obtained after information enhancement processing is performed on an initial text training set. The feature enhancement processing may be a processing procedure of performing font splitting or syllable splitting on the initial text training set to obtain a text or a syllable, so as to enhance the data features of the initial text training set. The data enhancement process may be a process of enhancing data characteristics of the initial training set of text by determining syllable replacement words or structural replacement words of the initial training set of text.
After the initial text training set is obtained, information enhancement processing can be performed on the initial text training set. Generally, the information enhancement processing may include one or more processing modes, such as feature enhancement processing and data enhancement processing. The initial text training set may include positive samples and negative samples, and the positive samples in the initial text training set are subjected to feature enhancement processing, and part of the negative samples in the initial text training set are subjected to data enhancement processing. By performing information enhancement processing on the initial text training set, under the condition of not enhancing the number of the training texts, more text representations can be processed from the initial text training set to obtain corresponding enhanced training texts.
Step S330, generating an enhanced text training set according to the initial text training set and the enhanced training text; the enhanced text training set is used for training an abnormal text classification model.
In some example embodiments, the enhanced text training set may be a training text set composed of the initial text training set and the enhanced training text together. The abnormal text classification model may be a network model for performing normal text and abnormal text classification.
And performing information enhancement processing on the initial text training set, wherein the obtained enhanced training text can contain more text features in the initial text training set. After the enhanced training text is obtained, an enhanced text training set can be generated based on the initial text training set and the enhanced training text, and more text features are extracted from the obtained enhanced text training set under the condition that the number of the training texts is not increased. After the enhanced text training set is obtained, the enhanced text training set can be used as a training data set of the text classification model to perform model training, so that a final abnormal text classification model is obtained.
In the model training method for abnormal text classification provided by the present exemplary embodiment, on one hand, feature enhancement and data enhancement processing are performed on the initial text training set, so that the distance between positive and negative samples can be increased, more representations are processed from the initial text training set, and the data and quality of the training set are improved. On the other hand, the abnormal text classification model obtained by training the enhanced text training set can better learn the respective characteristics of the positive and negative samples, so that the text classification result obtained by the abnormal text classification model is more accurate.
Next, the above steps of the model training method for classifying abnormal texts according to the present exemplary embodiment will be described in more detail.
In one embodiment of the present disclosure, a first training text is determined based on an initial training set of texts; the first training text is a training text consisting of abnormal texts; performing feature enhancement processing on the first training text to obtain a feature enhancement training text; determining a second training text based on the initial text training set; the second training text is a training sample opposite to the first training text; and performing data enhancement processing on the second training text to obtain a data enhancement training text.
The feature enhancement training text may be a text obtained by performing feature enhancement processing on an initial text training set. The data enhancement training text can be a text obtained by performing data enhancement processing on the initial text training set. The first training text may be a training text composed of target texts, for example, in text classification recognition, the first training text may be a training text composed of abnormal texts. The abnormal text can be a target sample, for example, in a hazardous mail interception scenario, the abnormal text can be text contained in a hazardous mail, and the training target of the abnormal text classification model is to identify the first training text. The second training text may be an inverse sample of the first training text, and the training goal of the abnormal text classification model is that the abnormal text classification model also needs to be able to distinguish which part of the text content is not the content of the spam.
After the initial text training set is obtained, samples in the initial text training set can be classified. Taking an invoice dangerous mail scene as an example, in order to identify a dangerous mail with "generation invoicing" content, a mail with relevant content such as "generation invoicing" in the initial text training set may be used as the first training text. And taking the normal invoice mail content contained in the initial text training set as a second training text. The second training text is typically the inverse of the first training text.
And for the determined first training text, performing enhancement processing by adopting a feature enhancement processing mode to obtain a feature enhancement training text. And for the determined second training text, enhancement processing can be performed by adopting a data enhancement processing mode to obtain a data enhancement training text, so that the data volume and the quality of the training set are effectively improved.
It will be readily understood by those skilled in the art that in other application scenarios, the first training text and the second training text may be classified according to specific actual scene types, for example, the applicable scenarios of the present disclosure also include bad comment recognition scenarios, sensitive content recognition scenarios, and the like, and the present disclosure does not make any particular limitation thereto. In general, the first training text may be a target text recognized by the abnormal text classification model, and the second training text may be an opposite sample of the first training text. For example, in a scene of bad speech recognition in the network, the first training text may be a target text to be recognized by the abnormal text classification model, and the second training text may be a normal comment text. In the following, the present disclosure will be described by taking an example of recognizing text content including "invoicing for a substitute" and the like.
In one embodiment of the disclosure, a word-separating dictionary is obtained; the word splitting dictionary comprises text word splitting rules; performing word breaking processing on the first training sample according to a text word breaking rule to obtain a plurality of word breaking texts; and adding the obtained plurality of character splitting texts to the character splitting enhancement training texts.
The word-separating dictionary may be a dictionary composed of rules used for separating a word. The text de-typing rule can be a rule used for performing de-typing processing on a certain character such as adding and subtracting strokes, splitting components, disordering a font structure and the like. The word splitting process may be a process of splitting words in the initial training set of text. The word splitting text can be a text obtained by splitting the words in the initial text training set.
In the dangerous mails such as the generation invoicing, senders sometimes choose to use various ways of deforming the same-structure characters, the traditional characters and the variant characters to replace normal texts contained in the mails. For example, with the word "ticket" as the initial training text, and with continued reference to fig. 1 (3) and (4), the sender can change the word "ticket" into "magnitude", "nimble", "float", etc., these words can be broken into "ticket" words through word breaking processing, and similarly, there are also the words "open", "micro", "true" etc. commonly used in the context of this kind of dangerous mails, and these confusing expression combinations can find the corresponding simplest word after word breaking, and the simplest word is the semantic meaning that the sender really wants to express.
In order to enable the model to effectively recognize the abnormal text, in this embodiment, a word-breaking processing mode may be adopted to perform feature enhancement processing on the text in the initial text training set. Referring to FIG. 4, FIG. 4 schematically illustrates a schematic diagram of a de-word process on text in an initial training set of text, according to some embodiments of the present disclosure. The content in fig. 4 is a word-splitting display, and various homostructural words of "ticket" appearing in the mail text can be determined by the word-splitting processing mode in fig. 4. The separation of the lines into lines and words which split the line including the line in FIG. 1 (4).
In addition, when the text in the initial text training set is subjected to the word splitting processing, the granularity of the word splitting can be controlled, wherein some words are actually meant, for example, two words of "number" in fig. 4, and the probability that the two words of "number" are replaced by the sender is generally considered to be low, so that the words are regarded as the minimum granularity in the word splitting dictionary. Since OOV problem is not existed any more, only the minimum granularity word after word splitting is ensured to be in the word list, and a 'ticket' word can be solved and represent all the variants, the word splitting method effectively improves the recognition rate of the model.
After the corresponding positive sample text in the scenario of the generation invoicing is subjected to word splitting by the word splitting processing mode based on the text word splitting rule in the word splitting dictionary, the characters obtained after the word splitting processing can be put into a word list, and a unique word splitting character identifier (token _ id) is configured for each character.
In one embodiment of the present disclosure, a syllable character string corresponding to a first training sample is determined; the syllable character string comprises a complete syllable character string and a partial syllable character string; syllable-enhanced training text is generated based on the complete syllable string and the partial syllable string.
The syllable character string can be a character string formed by syllables corresponding to the text in the initial text training set. The complete syllable string may be a string corresponding to all syllables constituting one character. The partial syllable character string may be a character string corresponding to a partial syllable constituting one letter. The syllable-enhanced training text may be a training text composed of syllable strings in the initial training set of text.
In the generation of invoices, such hazardous mail, the sender may also use the pinyin character string or initial acronym for various words to express the meaning of those words. With continued reference to fig. 1 (5), the content of the mail in fig. 1 (5) shows the pinyin of "piao", and two characters of "reed" and "scratch" in the small right angle represent that the recipient is instructed to add an instant messaging software account number for invoicing via instant messaging software. In order to solve the above problem, the present disclosure proposes a processing manner of determining syllable character strings (also called pinyin character strings) corresponding to the characters so as to solve the problem.
Referring to fig. 5, fig. 5 schematically illustrates a schematic diagram of determining syllable strings of text in an initial training set of text, according to some embodiments of the present disclosure. In fig. 5, the syllable string corresponding to the "ticket" word may be a complete syllable string, i.e., the "piao" may be directly generated from the "ticket" word. In addition, partial syllable character strings (such as initial letters "p") corresponding to the complete syllable character string "piao" and the "ticket" are directly encoded in the vocabulary, and the encoded syllable character strings are used as training texts to be input into the model for training. For another example, the initial consonant of the character of 'region' after the character is detached is 'q', and the character of 'region' is exactly coded by the entity of instant messaging software (qq) in the word list, so that the processing capability of the model can be further improved.
Extracting syllable character strings corresponding to the positive sample text, wherein the extracted syllable character strings comprise syllable parts such as whole pinyin character strings or initial consonants, placing the split characters and the extracted syllable character strings into a word list, and configuring a unique syllable character identifier (token _ id) for each syllable character string.
Further, when explaining the harmful mails, as shown in fig. 1 (4), a large number of harmful mails are sent to a user by embedding and disguising normal mails, a large number of texts in the mails are normal, a small number of texts belong to harmful information, and a model is easy to make a mistake in judgment, and judges a normal mail as a harmful mail; the mail in fig. 1 (5) is embedded in a normal electronic invoice mail.
In order to solve the problem of model misjudgment, when the training text is determined, the distance between the normal mail and the junk mail needs to be increased, so that the data enhancement processing mode is introduced in the method, and the data enhancement processing mode is respectively a homophone replacing mode, an equivalent word replacing mode and the like. Taking the electronic invoice as an example, the normal electronic invoice can be subjected to homophone word replacement and equivalent word replacement through the two data enhancement methods, and then the generated new samples are taken as positive samples.
Before the data enhancement processing is performed on the initial text training set, the text in the initial text training set may be placed in a word list after word segmentation processing, and a corresponding word segmentation text identifier (token _ id) is configured.
In one embodiment of the disclosure, homophonic characters to be replaced contained in the second training text are determined, and replaced homophonic characters corresponding to the homophonic characters to be replaced are obtained; and replacing the homophone characters to be replaced by homophones to obtain homophone character replacement training texts.
The homophonic character to be replaced can be a character contained in the second training text and having the same syllable with other characters. The replacement homophonic words may be words for replacing homophonic words to be replaced in the second training text. The homophone word replacement training text can be obtained by replacing homophone words to be replaced in the second training text with replacement homophone words.
In order to enable the model to better distinguish the spam mails disguised by the normal mails, the distance between the normal mails and the spam mails is shortened, homophones can be used for replacement, and the data enhancement method can convert common words into words suspected of being spam. Referring to fig. 6, fig. 6 schematically illustrates a schematic diagram of homophone substitution according to some embodiments of the present disclosure. In FIG. 6, "n" can be replaced with "di", "", "ground"; for example, the word "invoice" in the mail for taking an invoice for harm can be actively replaced by the word "egg-case", and the word "issue" can be actively replaced by the words such as "", which is equivalent to using normal mail to fight against. The model can not only continue to recognize texts which cannot be processed in the above feature construction, such as '', but also can open up the judgment gap of normal texts in the training process.
A common use of this alternative in exception Wen Benfen class is: using the sample obtained after replacement as the same type as the original sample; for example, several words of a sentence are replaced by traditional words, and the classification of the replaced sample is the same as that of the original sample, which is equivalent to enhancing the generalization of the original sample. In the method, the characters are used as reverse samples (namely, the replaced characters are used as positive samples and are changed into reverse classifications different from the original classifications, which is equivalent to the reverse classifications of the normal invoices, namely, the garbage invoices are processed, namely the positive sample classification), so that the distinguishing capability of the model on the normal mails is further highlighted, and the normal mails are prevented from being judged as harmful mails under the condition that overfitting does not occur.
In one embodiment of the disclosure, equivalent characters to be replaced contained in the second training text are determined; acquiring a replacement equivalent word corresponding to the equivalent word to be replaced; the replacement equivalent words are characters which have the same meaning and different expression forms with the equivalent words to be replaced; and replacing the equivalent words to be replaced by using the replacement equivalent words to obtain equivalent word replacement training texts.
The equivalent words to be replaced can be words which can be represented by other different expression forms in the second training text. The replacement equivalent word may be a word for replacing the equivalent word to be replaced in the second training text. The equivalent word replacement training text may be a training text obtained by replacing equivalent words to be replaced in the second training text with replacement equivalent words.
The equivalent word replacement is also a very important processing mode, and the meaning of the equivalent word is to replace the related text content in the second training text by using two characters with the same meaning but different expression forms, and the characters comprise the same meaning words and traditional words. The traditional words are also equivalent word replacements, and because the substructures of some traditional words do not include the original word, the equivalent word replacements are different from the meanings of using a word splitting dictionary to split the word. Referring to fig. 7, fig. 7 schematically illustrates a schematic diagram of equivalence word replacement according to some embodiments of the present disclosure. In FIG. 7, the number "1" can be replaced by "one", "(1)", "Fa" can be replaced by "Rich", and the samples obtained after replacement are used for identifying the dangerous spam.
In the lottery-type dangerous mail, the anti-spam system can receive a large amount of forged dangerous mails, and a sender can mix in the dangerous information by tampering the information or the subject of the sender. Referring to fig. 8, fig. 8 schematically illustrates a schematic diagram of counterfeiting a hazardous mail with a normal mail according to some embodiments of the present disclosure. In fig. 8, the sender mixes the lottery information into a normal system notification, and injects the contact information of the sender into a normal mail, and the model needs to be able to recognize the part of information and distinguish it as a lottery harmful mail, but cannot misjudge a normal notification mail, and here, it needs to perform data enhancement processing on the training text by using equivalent word replacement.
The two processing modes of homophone word replacement and equivalent word replacement can be used for carrying out data enhancement processing on the normal electronic invoice to generate a corresponding positive sample. Through the processing mode, a sample of homophone word replacement and equivalent word replacement can be randomly generated for each electronic invoice text, after a corresponding positive sample is generated, word segmentation processing can be performed on the generated positive sample, characters obtained after word segmentation are placed in a word list, and a corresponding token _ id is configured.
In one embodiment of the disclosure, the enhanced text training set is input to the initial text classification model to perform model training on the initial text classification model to obtain a trained abnormal text classification model; the initial text classification model comprises a feature enhancement layer, a data enhancement layer and an overlay layer; the feature enhancement layer is used for extracting features of the feature enhancement training text to generate feature enhancement text vectors; the data enhancement layer is used for extracting the characteristics of the initial text training set and the data enhancement training text to generate a data enhancement text vector; the overlay layer is used for generating an overlay vector based on the feature enhanced text vector and the data enhanced text vector.
Wherein, the initial text classification model can be a pre-constructed deep learning model. The abnormal text classification model may be a model obtained by performing model training processing on the initial text classification model. The feature enhancement layer may be a vector representation layer for extracting text features in the feature enhanced training text. The feature-enhanced text vector may be a text vector obtained by performing feature extraction and vector representation on the feature-enhanced training text by using a feature enhancement layer.
The data enhancement layer may be a vector representation layer for text feature extraction of the initial training text in the initial text training set and the data enhancement training text. The data enhancement text vector may be a text vector obtained by performing feature extraction and vector representation on the initial training text and the data enhancement training text in the initial text training set by using the data enhancement layer. The overlay layer may be a presentation layer for vector overlay processing of the feature enhanced text vector and the data enhanced text vector. The superimposed vector may be a text vector obtained by superimposing the feature-enhanced text vector and the data-enhanced text vector.
Referring to fig. 9, fig. 9 schematically illustrates a block diagram of a model for abnormal text classification, according to some embodiments of the present disclosure. In order to achieve a better effect in the model training process, three vector representation layers (Embedding layers) are defined in the model structure in the implementation process, and the three vector representation layers comprise a feature enhancement layer, a data enhancement layer, an overlay layer and the like.
And acquiring a feature enhancement training text obtained through feature enhancement processing, inputting the feature enhancement training text into a feature enhancement layer, and extracting text features of the feature enhancement training text by the feature enhancement layer to obtain a feature enhancement text vector. And inputting the texts in the initial training text set and the data enhancement training texts into a data enhancement layer, and extracting text characteristics of the texts in the initial training text set and the data enhancement training texts by the data enhancement layer to obtain corresponding data enhancement text vectors.
After the feature enhanced text vector and the data enhanced text vector are obtained, the two vectors can be overlapped through an overlapping layer, and an overlapping vector is obtained. And training the initial text classification model according to the obtained text vectors, so that different Embedding layers in the model learn corresponding text features, and finally obtaining the abnormal text classification model.
In addition, the present disclosure also provides a text classification method, as shown in fig. 10, the text classification method may include the following steps S1010 to S1030:
and step S1010, acquiring the text to be recognized.
In some example embodiments, the text to be recognized may be text to be subjected to an abnormal text classification recognition process.
After the abnormal text classification model is obtained through training, text classification processing can be carried out based on the abnormal text classification model. Before text classification, a text to be recognized may be obtained, and the text to be recognized may be text content determined directly from a received email. When the received mail content contains an image, the text to be recognized may also be the text content obtained by performing OCR processing on the image in the mail. The determination mode of the text to be recognized is not limited in any way.
Step S1020, inputting the text to be recognized into a pre-trained abnormal text classification model, and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on a model training method for abnormal text classification.
In some example embodiments, the text classification process may be a process of classifying text to be recognized.
After the text to be recognized is obtained, the text to be recognized can be input into a pre-trained abnormal text classification model, the abnormal text classification model is used for classifying the text to be recognized, and content features contained in normal mails and abnormal mails can be extracted by the abnormal text classification model, so that the abnormal text classification model can effectively classify the mail content.
And step S1030, determining a text classification result of the text to be recognized according to the output result of the abnormal text classification model.
In some example embodiments, the text classification result may be a classification result obtained by performing classification processing on the text to be recognized.
After receiving the text to be recognized, the abnormal text classification model may classify the text to be recognized and obtain a corresponding output result, for example, an output result of the text to be recognized belonging to a positive classification or a negative classification. When the output result of the model is positive classification, the text classification result is considered as spam; and when the output result of the model is negative classification, the text classification result is considered as a normal mail.
Referring to table 1, table 1 shows the classification effect of text classification of a model obtained by training without any information enhancement processing mode. Referring to table 2, table 2 shows the classification effect generated by the text classification process using the abnormal text classification model of the present disclosure.
TABLE 1
Original label Mispredicted labels Counting Error rate
Others Garbage invoice 1 0.001
Garbage invoice Others 7 0.003
Garbage invoice Normal invoice 3 0.001
Normal invoice Garbage invoice 30 0.027
Normal invoice Others 4 0.004
TABLE 2
Original label Mispredicted labels Counting Error rate
Refuse receipt Others 5 0.002
Garbage invoice Normal invoice 2 0.001
Normal invoice Garbage invoice 5 0.005
As can be seen from the contents in tables 1 and 2, the abnormal text classification model obtained after feature enhancement and data enhancement training greatly reduces the confusion degree between normal invoices, junk invoices and other categories. In addition, the report frequency is reduced from 168.71 to 30.29 and is reduced from 82.05% from the daily report data of the text invoice, and the effect is obvious.
On one hand, the text classification method provided by the exemplary embodiment is characterized in that the abnormal text classification model is obtained by training based on the training sample set after information enhancement processing, and the model can extract respective characteristics of positive and negative samples, thereby improving the accuracy of the text classification result and reducing the result misjudgment. On the other hand, the harmful mails can be prevented from being judged as normal mails to a certain extent, and the identification capability of the harmful mails can be further improved and the transmission capability of the harmful mails can be reduced by combining the two judgment modes.
Exemplary devices
Having described the method of the exemplary embodiment of the present disclosure, next, a model training apparatus for abnormal text classification of an exemplary embodiment of the present disclosure will be described with reference to fig. 11.
In fig. 11, a model training apparatus 1100 for abnormal text classification may include: an initial training set acquisition module 1110, an information enhancement processing module 1120, and an enhanced training set generation module 1130.
The initial training set obtaining module 1110 is configured to obtain an initial text training set; the information enhancement processing module 1120 is used for performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing includes at least one of feature enhancement processing and data enhancement processing; an enhanced training set generating module 1130, configured to generate an enhanced text training set according to the initial text training set and the enhanced training text; the enhanced text training set is used for training an abnormal text classification model.
In one embodiment of the present disclosure, the enhanced training texts comprise feature enhanced training texts and data enhanced training texts; the information enhancement processing module 1120 includes an information enhancement processing unit for: determining a first training text based on an initial text training set; the first training text is a training text consisting of abnormal texts; performing feature enhancement processing on the first training text to obtain a feature enhancement training text; determining a second training text based on the initial text training set; the second training text is a training sample opposite to the first training text; and performing data enhancement processing on the second training text to obtain a data enhancement training text.
In one embodiment of the present disclosure, the feature enhancement training text comprises a de-word enhancement training text; the information enhancement processing unit comprises a first feature enhancer unit for: acquiring a word splitting dictionary; the word splitting dictionary comprises text word splitting rules; performing word breaking processing on the first training sample according to a text word breaking rule to obtain a plurality of word breaking texts; and adding the obtained plurality of word-breaking texts to the word-breaking enhancement training texts.
In one embodiment of the present disclosure, the feature enhanced training text comprises syllable enhanced training text; the information enhancement processing unit comprises a second feature enhancer unit for: determining syllable character strings corresponding to the first training samples; the syllable character string comprises a complete syllable character string and a partial syllable character string; syllable-enhanced training text is generated based on the full syllable string and the partial syllable string.
In one embodiment of the present disclosure, the data enhancement training text includes homophone replacement training text; the information enhancement processing unit comprises a first data enhancement enhancer unit for: determining homophonic characters to be replaced contained in the second training text, and acquiring replaced homophonic characters corresponding to the homophonic characters to be replaced; and replacing the characters to be replaced by adopting the replacement homophones to obtain homophones replacement training texts.
In one embodiment of the present disclosure, the data enhancement training text includes an equivalent word replacement training text; the information enhancement processing unit comprises a second data enhancer unit for: determining equivalent characters to be replaced contained in the second training text; acquiring a replacement equivalent word corresponding to the equivalent word to be replaced; the replacement equivalent words are characters which have the same meaning and different expression forms with the equivalent words to be replaced; and replacing the character to be replaced by the replacement equivalent character to obtain an equivalent character replacement training text.
In an embodiment of the present disclosure, the enhanced text training set includes a feature enhanced training text and a data enhanced training text, and the abnormal text classification model training apparatus 1110 further includes a model training module, where the model training module is configured to: inputting the enhanced text training set into an initial text classification model to perform model training on the initial text classification model to obtain a trained abnormal text classification model; the initial text classification model comprises a feature enhancement layer, a data enhancement layer and an overlay layer; the feature enhancement layer is used for extracting features of the feature enhancement training text to generate feature enhancement text vectors; the data enhancement layer is used for extracting the characteristics of the initial text training set and the data enhancement training text to generate a data enhancement text vector; the overlay layer is used for generating an overlay vector based on the feature enhanced text vector and the data enhanced text vector.
Further, a text classification device of an exemplary embodiment of the present disclosure is explained with reference to fig. 12.
In fig. 12, the text classification apparatus 1200 may include: a text to be recognized acquisition module 1210, a text classification module 1220, and a result determination module 1230.
The text to be recognized acquiring module 1210 is configured to acquire a text to be recognized; the text classification module 1220 is configured to input the text to be recognized into a pre-trained abnormal text classification model, and perform text classification processing on the text to be recognized; the abnormal text classification model is obtained based on a model training method for abnormal text classification; and the result determining module 1230 is configured to determine a text classification result of the text to be recognized according to an output result of the abnormal text classification model.
Since each functional module of the model training apparatus for abnormal text classification in the exemplary embodiment of the present disclosure corresponds to the steps of the above-mentioned model training method for abnormal text classification and the exemplary embodiment of the text classification method, for details that are not disclosed in the embodiment of the apparatus of the present disclosure, please refer to the above-mentioned embodiment of the model training method for abnormal text classification and the text classification method of the present disclosure, and details are not repeated here.
It should be noted that although several modules or units of the model training apparatus for abnormal text classification are mentioned in the above detailed description, such division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.
In a third aspect of embodiments of the present disclosure, a computer-readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the model training method for abnormal text classification, the text classification method as described in the first aspect above.
Exemplary Medium
Having described the apparatuses of the exemplary embodiments of the present disclosure, next, a storage medium of the exemplary embodiment of the present disclosure will be described with reference to fig. 13.
In some embodiments, aspects of the present disclosure may also be implemented as a medium having stored thereon program code for implementing steps in a model training method for abnormal text classification according to various exemplary embodiments of the present disclosure described in the above-mentioned "exemplary methods" section of this specification, when the program code is executed by a processor of a device.
For example, when the processor of the apparatus executes the program code, the execution steps of the model training method for abnormal text classification as described in fig. 3 may be implemented, including: step S310, obtaining an initial text training set; step S320, performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing includes at least one of feature enhancement processing and data enhancement processing; step S330, generating an enhanced text training set according to the initial text training set and the enhanced training text; the enhanced text training set is used for training an abnormal text classification model. And the execution steps of the text classification method as described in fig. 10 include: step S1010, acquiring a text to be recognized; step S1020, inputting the text to be recognized into a pre-trained abnormal text classification model, and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on a model training method for abnormal text classification; and step S1030, determining a text classification result of the text to be recognized according to the output result of the abnormal text classification model.
Referring to fig. 13, a program product 1300 for implementing the above-described model training method for abnormal text classification or implementing the above-described text classification method according to an embodiment of the present disclosure is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present disclosure is not limited thereto.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. The readable signal medium may also be any readable medium other than a readable storage medium.
Program code for carrying out operations for the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN).
Exemplary computing device
Having described the model training method for abnormal text classification, the model training apparatus for abnormal text classification, and the storage medium according to the exemplary embodiments of the present disclosure, an electronic device according to the exemplary embodiments of the present disclosure will be described next with reference to fig. 14.
As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.
In some possible embodiments, an electronic device according to the present disclosure may include at least one processing unit, and at least one memory unit. Wherein the storage unit stores program code that, when executed by the processing unit, causes the processing unit to perform the steps in the model training method for abnormal text classification according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above in this specification. For example, the processing unit may execute the executing steps of the model training method for abnormal text classification as shown in fig. 3, including: step S310, obtaining an initial text training set; step S320, performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing includes at least one of feature enhancement processing and data enhancement processing; step S330, generating an enhanced text training set according to the initial text training set and the enhanced training text; the enhanced text training set is used for training an abnormal text classification model. And the execution steps of the text classification method as described in fig. 10 include: step S1010, acquiring a text to be recognized; step S1020, inputting the text to be recognized into a pre-trained abnormal text classification model, and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on a model training method for abnormal text classification; step S1030, determining a text classification result of the text to be recognized according to the output result of the abnormal text classification model.
An electronic device 1400 according to an example embodiment of the disclosure is described below with reference to fig. 14. The electronic device 1400 shown in fig. 14 is only an example and should not bring any limitations to the functionality and scope of use of the embodiments of the present disclosure.
As shown in fig. 14, the electronic device 1400 is embodied in the form of a general purpose computing device. The components of the electronic device 1400 may include, but are not limited to: the at least one processing unit 1410, the at least one memory unit 1420, the bus 1430 that connects the various system components (including the memory unit 1420 and the processing unit 1410), and the display unit 1440.
Wherein the storage unit stores program code that is executable by the processing unit 1410, such that the processing unit 1410 performs steps according to various exemplary embodiments of the present disclosure described in the "exemplary methods" section above in this specification.
The storage unit 1420 may include readable media in the form of volatile memory units, such as a random access memory unit (RAM) 1421 and/or a cache memory unit 1422, and may further include a read only memory unit (ROM) 1423.
Storage unit 1420 may also include a program/utility 1424 having a set (at least one) of program modules 1425, such program modules 1425 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.
The bus 1430 may include a data bus, an address bus, and a control bus.
The electronic device 1400 may also communicate with one or more external devices 1470 (e.g., keyboard, pointing device, bluetooth device, etc.), which may be through an input/output (I/O) interface 1450. Also, the electronic device 1400 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 1460. As shown, the network adapter 1460 communicates with the other modules of the electronic device 1400 via the bus 1430. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 1400, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.
It should be noted that although several units/modules or sub-units/modules of the model training apparatus for abnormal text classification are mentioned in the above detailed description, such division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, in accordance with embodiments of the present disclosure. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.
Further, while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.
While the spirit and principles of the present disclosure have been described with reference to several particular embodiments, it is to be understood that the present disclosure is not limited to the particular embodiments disclosed, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit. The disclosure is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims (10)

1. A model training method for abnormal text classification is characterized by comprising the following steps:
acquiring an initial text training set;
performing information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing comprises at least one of feature enhancement processing and data enhancement processing;
generating an enhanced text training set according to the initial text training set and the enhanced training text; and the enhanced text training set is used for training an abnormal text classification model.
2. The method of claim 1, wherein the enhanced training text comprises feature enhanced training text and data enhanced training text; the performing information enhancement processing on the initial text training set to obtain an enhanced training text includes:
determining a first training text based on the initial text training set; the first training text is a training text consisting of abnormal texts;
performing the feature enhancement processing on the first training text to obtain the feature enhanced training text;
determining a second training text based on the initial text training set; the second training text is a training sample opposite to the first training text;
and performing the data enhancement processing on the second training text to obtain the data enhancement training text.
3. The method of claim 2, wherein the feature-enhanced training text comprises a de-typer-enhanced training text; the performing the feature enhancement processing on the first training text to obtain the feature enhanced training text includes:
acquiring a word splitting dictionary; the word splitting dictionary comprises a text word splitting rule;
performing word breaking processing on the first training sample according to the text word breaking rule to obtain a plurality of word breaking texts;
and adding the obtained plurality of word-breaking texts to the word-breaking enhancement training texts.
4. The method of claim 2, wherein the data-enhanced training text comprises homophone replacement training text; the performing the data enhancement processing on the second training text to obtain the data enhancement training text includes:
determining homophonic characters to be replaced contained in the second training text, and acquiring replaced homophonic characters corresponding to the homophonic characters to be replaced;
and replacing the homophonic characters to be replaced by adopting the replaced homophonic characters to obtain homophonic character replacement training texts.
5. The method of claim 1, wherein the enhanced text training set comprises feature enhanced training text and data enhanced training text, the method further comprising:
inputting the enhanced text training set into an initial text classification model to perform model training on the initial text classification model to obtain a trained abnormal text classification model;
the initial text classification model comprises a feature enhancement layer, a data enhancement layer and an overlay layer;
the feature enhancement layer is used for extracting features of the feature enhancement training text to generate feature enhancement text vectors;
the data enhancement layer is used for extracting the characteristics of the initial text training set and the data enhancement training text to generate a data enhancement text vector;
the overlay layer is configured to generate an overlay vector based on the feature-enhanced text vector and the data-enhanced text vector.
6. A method of text classification, comprising:
acquiring a text to be identified;
inputting the text to be recognized into a pre-trained abnormal text classification model, and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on the model training method for abnormal text classification as claimed in any one of claims 1-5;
and determining the text classification result of the text to be recognized according to the output result of the abnormal text classification model.
7. A model training apparatus for abnormal text classification, comprising:
the initial training set acquisition module is used for acquiring an initial text training set;
the information enhancement processing module is used for carrying out information enhancement processing on the initial text training set to obtain an enhanced training text; the information enhancement processing comprises at least one of feature enhancement processing and data enhancement processing;
the enhanced training set generating module is used for generating an enhanced text training set according to the initial text training set and the enhanced training text; and the enhanced text training set is used for training an abnormal text classification model.
8. A text classification apparatus, comprising:
the text to be recognized acquisition module is used for acquiring a text to be recognized;
the text classification module is used for inputting the text to be recognized to a pre-trained abnormal text classification model and performing text classification processing on the text to be recognized; the abnormal text classification model is obtained based on the model training method for abnormal text classification as claimed in any one of claims 1-5;
and the result determining module is used for determining the text classification result of the text to be recognized according to the output result of the abnormal text classification model.
9. An electronic device, comprising:
a processor; and
a memory having stored thereon computer readable instructions which, when executed by the processor, implement the model training method for abnormal text classification of any one of claims 1-5; and the text classification method of any one of claim 6.
10. A computer-readable medium, on which a computer program is stored, which, when being executed by a processor, implements the model training method for abnormal text classification of any one of 1 to 5; and the text classification method of any one of claim 6.
CN202210880495.6A 2022-07-25 2022-07-25 Model training method, text classification method, device, medium and electronic equipment Pending CN115422926A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210880495.6A CN115422926A (en) 2022-07-25 2022-07-25 Model training method, text classification method, device, medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210880495.6A CN115422926A (en) 2022-07-25 2022-07-25 Model training method, text classification method, device, medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN115422926A true CN115422926A (en) 2022-12-02

Family

ID=84197264

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210880495.6A Pending CN115422926A (en) 2022-07-25 2022-07-25 Model training method, text classification method, device, medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN115422926A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118675181A (en) * 2024-05-16 2024-09-20 北京思特奇信息技术股份有限公司 Multi-category invoice identification method, system, electronic equipment and storage medium
WO2025098098A1 (en) * 2023-11-08 2025-05-15 阿里巴巴(中国)有限公司 Sample processing methods, system and electronic device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2025098098A1 (en) * 2023-11-08 2025-05-15 阿里巴巴(中国)有限公司 Sample processing methods, system and electronic device
CN118675181A (en) * 2024-05-16 2024-09-20 北京思特奇信息技术股份有限公司 Multi-category invoice identification method, system, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111061874B (en) Sensitive information detection method and device
CN111291195B (en) Data processing method, device, terminal and readable storage medium
Eger et al. From Hero to Z\'eroe: A Benchmark of Low-Level Adversarial Attacks
Salman et al. Investigating evasive techniques in SMS spam filtering: A comparative analysis of machine learning models
CA3048356A1 (en) Unstructured data parsing for structured information
CN111783443A (en) Text disturbance detection method, disturbance reduction method, disturbance processing method and device
WO2017173093A1 (en) Method and device for identifying spam mail
CN115422926A (en) Model training method, text classification method, device, medium and electronic equipment
WO2014205232A1 (en) Language input method editor to disambiguate ambiguous phrases via diacriticization
CN114818689A (en) Domain name detection method, device, equipment and storage medium
US20220139386A1 (en) System and method for chinese punctuation restoration using sub-character information
CN111866004A (en) Security assessment method, apparatus, computer system, and medium
Wang et al. Validating multimedia content moderation software via semantic fusion
Al-Kabbi et al. A Hierarchical Two-Level Feature Fusion Approach for SMS Spam Filtering.
Gui et al. PSC-BERT: A spam identification and classification algorithm via prompt learning and spell check
Raza Ur Rehman et al. Detecting hate in diversity: a survey of multilingual code-mixed image and video analysis
US9077813B2 (en) Masking mobile message content
CN119577215A (en) Model training, text content recognition method, device, equipment and medium
CN114416974A (en) Model training method, device, electronic device and storage medium
Sabri et al. Leveraging bias in pre-trained word embeddings for unsupervised microaggression detection
WO2024240622A1 (en) Systems and methods of detecting chatbots
WO2024220265A1 (en) Intelligent classification of text-based content
KR102410582B1 (en) Apparatus, method and computer program for augmenting learning data for harmful words
Yang et al. TeC: A Novel Method for Text Clustering with Large Language Models Guidance and Weakly-Supervised Contrastive Learning
Onuora et al. Machine Learning Architecture for Combating Hate Speech in Igbo Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination