A kind of Risk Text recognition methods and device
Technical field
This specification is related to internet area more particularly to a kind of Risk Text recognition methods and device.
Background technique
With the rise of mobile internet, the products such as electric business, community platform, short-sighted frequency, live streaming flourish, huge use
Family group contributes to a large amount of good original contents.At the same time, greyish black production team waits for the opportune moment to go into action, and it is wide to have manufactured magnanimity rubbish
It accuses, undisguisedly rubbish contents, internet product and the users such as comment, fraud information deeply hurt.
The mode of prior art anti-spam text is usually to generate the keyword rule based on text: according to black text intermediate frequency
Certain type mode of numerous appearance, by manually summarizing or machine automatic mining goes out risk recognition rule, such as will " flower "
" arbitrage " occurs being considered as a kind of risk identification rule simultaneously, and then is identified using risk identification rule to text.
But emoticon is widely used so that rubbish text has new upgrading direction again, a large amount of users violated in violation of rules and regulations
In order to evade traditional anti-spam model, it is mingled in normal text using emoticon.And traditional keyword recognition rule is simultaneously
These spcial characters are not considered, if be transformed consciously to Risk Text, replacing normal risk text can drop
The low probability identified by tradition based on the anti-spam model of keyword.There is presently no a kind of preferable methods, cope with this packet
Risk Text containing emoticon.
Summary of the invention
In view of the above technical problems, this specification embodiment provides a kind of Risk Text recognition methods and device, technical side
Case is as follows:
According to this specification embodiment in a first aspect, provide a kind of Risk Text recognition methods, this method comprises:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature
The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match,
The text to be identified is determined as the text comprising risk.
According to the second aspect of this specification embodiment, a kind of Risk Text identification device is provided, which includes:
Rule generation module: for calculating the emoticon feature in text according to preset risk algorithm, according to described
Emoticon feature generates the risk identification rule comprising risk expression;
Text identification module;For obtaining text to be identified, the risk identification rule is carried out in text to be identified
Matching, if successful match, is determined as the text comprising risk for the text to be identified.
According to the third aspect of this specification embodiment, a kind of computer equipment is provided, including memory, processor and deposit
Store up the computer program that can be run on a memory and on a processor, wherein the processor is realized when executing described program
A kind of Risk Text recognition methods, this method comprises:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature
The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match,
The text to be identified is determined as the text comprising risk.
Technical solution provided by this specification embodiment, using same emoticon, the frequency of occurrences is not in black and white text
With this characteristic, the biggish emoticon of frequency of occurrences difference in black and white text is extracted, and then be combined into comprising emoticon
Number risk identification rule, to make up missing of the conventional keyword recognition rule in terms of emoticon.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not
This specification embodiment can be limited.
In addition, any embodiment in this specification embodiment does not need to reach above-mentioned whole effects.
Detailed description of the invention
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or
Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only
The some embodiments recorded in this specification embodiment for those of ordinary skill in the art can also be attached according to these
Figure obtains other attached drawings.
Fig. 1 is a kind of schematic diagram of the emoticon shown in one exemplary embodiment of this specification;
Fig. 2 is a kind of flow chart of the Risk Text recognition methods shown in one exemplary embodiment of this specification;
Fig. 3 is a kind of flow chart of the risk identification rule generating method shown in one exemplary embodiment of this specification;
Fig. 4 is another flow chart of the risk identification rule generating method shown in one exemplary embodiment of this specification;
Fig. 5 is another flow chart of the risk identification rule generating method shown in one exemplary embodiment of this specification;
Fig. 6 is a kind of schematic diagram of the Risk Text identification device shown in one exemplary embodiment of this specification;
Fig. 7 is a kind of structural schematic diagram of computer equipment shown in one exemplary embodiment of this specification.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistent with this specification.On the contrary, they are only and such as institute
The example of the consistent device and method of some aspects be described in detail in attached claims, this specification.
It is only to be not intended to be limiting this explanation merely for for the purpose of describing particular embodiments in the term that this specification uses
Book.The "an" of used singular, " described " and "the" are also intended to packet in this specification and in the appended claims
Most forms are included, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein is
Refer to and includes that one or more associated any or all of project listed may combine.
It will be appreciated that though various information may be described using term first, second, third, etc. in this specification, but
These information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not taking off
In the case where this specification range, the first information can also be referred to as the second information, and similarly, the second information can also be claimed
For the first information.Depending on context, word as used in this " if " can be construed to " ... when " or
" when ... " or " in response to determination ".
With the rise of mobile internet, the products such as electric business, community platform, short-sighted frequency, live streaming flourish, huge use
Family group contributes to a large amount of good original contents.At the same time, greyish black production team waits for the opportune moment to go into action, and it is wide to have manufactured magnanimity rubbish
It accuses, undisguised comment, fraud information, internet product and users deeply hurt.
The mode of prior art anti-spam text is usually to generate the keyword rule based on text: according to black text intermediate frequency
Certain type mode of numerous appearance, by manually summarizing or machine automatic mining goes out risk recognition rule, such as will " flower "
" arbitrage " occurs being considered as a kind of risk identification rule simultaneously, and then is identified using risk identification rule to text.
But emoticon is widely used so that rubbish text has new upgrading direction again, a large amount of users violated in violation of rules and regulations
In order to evade traditional anti-spam model, it is mingled in normal text using emoticon, such as emoji expression or other expressions,
With reference to Fig. 1.And traditional keyword recognition rule does not consider these spcial characters, if carried out consciously to Risk Text
Transformation, the probability identified by tradition based on the anti-spam model of keyword can be reduced by replacing normal risk text.At present
There are no a kind of preferable methods, cope with this Risk Text comprising emoticon.
In view of the above problems, this specification embodiment provides a kind of Risk Text recognition methods, and a kind of for executing
The Risk Text identification device of this method.It is flat in the Internet community that the method that this specification embodiment is mentioned is mainly used in user
The text of platform publication, specifically, community platform may include BBS/ forum, discussion bar, announcement board, personal knowledge is issued, group begs for
By etc. online communations platform.
The Risk Text recognition methods that the present embodiment is related to is described in detail below, shown in Figure 2, this method can
With the following steps are included:
S201 calculates the emoticon feature in text according to preset risk algorithm, according to the emoticon feature
Generate the risk identification rule comprising risk expression;
S202 obtains text to be identified, and the risk identification rule is matched in text to be identified, if matching at
The text to be identified is then determined as the text comprising risk by function.
Specifically, it is determined that at least one emoticon feature, and generated according to emoticon feature comprising risk expression
The method of risk identification rule may comprise steps of referring to Fig. 3:
S301, obtains black text collection, and the black text collection is the sample set comprising multiple Risk Texts;
S302 extracts emoticon existing characteristics in black text collection, is generated according to the emoticon existing characteristics
Risk identification rule comprising risk expression.
For example, carrying out feature identification in multiple Risk Texts in black text collection, expression symbolic feature is found
The frequency of occurrences of "+one emoticon [phone] of two emoticons [money] " in black text collection is apparently higher than average water
It is flat, then the combination of the emoticon is determined as an emoticon existing characteristics in black text collection, or, working as emoticon
[money] in the text accounting be higher than some threshold value when, by accounting in text be higher than the threshold value emoticon [money] be determined as
An emoticon existing characteristics in black text collection.
The feature extracting method of emoticon is not limited solely to the above citing, can set more features according to actual needs
Extracting rule.
Other than extracting emoticon existing characteristics in black text collection and then generating risk identification rule, the application is also
A kind of method that the risk identification rule comprising risk expression is generated according to black text collection and white text set is provided, referring to figure
4, method includes the following steps:
S401 obtains black text collection and white text set, calculate different emoticons the black text collection with it is white
The difference on the frequency occurred in text collection, wherein black text collection is the sample set comprising multiple Risk Texts, white text set
For the sample set comprising multiple devoid of risk texts;
Black text collection and white text collection are combined into the set of pre-prepd multiple black/white samples of text, wherein black text
For the fixed Risk Text comprising rubbish contents, white text is the fixed safe text not comprising rubbish contents.It needs
It should be noted that generally making the amount of text of black text collection and white text set/size phase to keep statistical result accurate as far as possible
Closely.
Emoticon wherein included is extracted respectively in black text collection and white text set, and counts different expressions
The frequency that symbol occurs in black text and the frequency occurred in white text.Shown in reference table 1.
Emoticon |
The frequency of occurrences in black text collection |
The frequency of occurrences in white text set |
Expression [seals face] |
0.05375 |
0.0376 |
Expression [fresh flower] |
0.04678 |
0.0375 |
Expression [happiness] |
0.04446 |
0.03392 |
Expression [phone] |
0.02462 |
0.02442 |
…… |
|
…… |
Table 1
After obtaining the frequency that different emoticons occur in black text and the frequency occurred in white text, calculate each
Emoticon is in black, the difference of the frequency occurred in white text, shown in reference table 2:
Emoticon |
The difference on the frequency occurred in black and white text set |
Expression [seals face] |
0.01615 |
Expression [fresh flower] |
0.00928 |
Expression [happiness] |
0.01054 |
Expression [phone] |
0.02462 |
…… |
…… |
Table 2
It is appreciated that when some emoticon is higher than appearance frequency of the expression in white text in the frequency of occurrences of black text
Rate, and when frequency difference of the emoticon in black and white text is apparently higher than other emoticons, then the emoticon has very
Maximum probability is used in greyish black production clique, to carry out text insertion or replacement in rubbish contents.
The corresponding difference on the frequency of the difference emoticon is determined as frequency difference set, by the frequency difference set by S402
The middle biggish difference on the frequency of numerical value is determined as qualified difference on the frequency, by the corresponding emoticon of the qualified difference on the frequency
It is determined as risk expression;
Table 2 as above, is each expression and the corresponding frequency difference set of the expression, is screened in the frequency difference set,
The biggish difference on the frequency of numerical value is determined as qualified difference on the frequency.
Wherein it is determined that there are many kinds of the methods of qualified difference on the frequency, for example:
A) each difference on the frequency in frequency difference set is sorted from high to low according to numerical values recited, will be sorted forward N number of
Difference on the frequency is determined as qualified difference on the frequency;
B) each difference on the frequency in frequency difference set is sorted from high to low according to numerical values recited, by the forward N% that sorts
A difference on the frequency is determined as qualified difference on the frequency;
C) difference on the frequency that frequency difference set intermediate frequency rate difference is higher than preset value is determined as qualified difference on the frequency;
D) each difference on the frequency in frequency difference set is sorted from high to low according to numerical values recited, screening and sequencing is forward
The difference on the frequency that wherein frequency difference is higher than preset value is determined as qualified difference on the frequency by N% difference on the frequency.
It is noted that the method for qualified difference on the frequency determined above is only for example, this specification is not constituted and limited
Fixed, developer can screen the biggish difference on the frequency of numerical value by different modes according to the actual situation in frequency difference set.
Fixed different risk expressions are carried out permutation and combination by S403, and being generated according to permutation and combination result includes risk
The risk identification rule of expression.
Different risk expressions are subjected to permutation and combination, using permutation and combination result as risk identification rule.For example:
Determining risk expression is emoticon [fresh flower], emoticon [happiness], emoticon [phone], in the text that user delivers
In, will occur simultaneously emoticon [fresh flower] in text and emoticon [happiness] is considered as a kind of risk combination, it will be same in text
When there is emoticon [happiness] and emoticon [phone] is considered as another risk combination ... and so on, by different wind
The permutation and combination of dangerous expression lists a variety of possible risk combinations.
In practical applications, it can be combined in conjunction with existing risk keyword, for example: it will go out simultaneously in text
Existing risk keyword " borrowing " and risk expression [money] are considered as a kind of risk and combine, and in text while risk keyword will occur
" regular " is considered as another risk with risk expression [phone] and combines.Risk keyword, which can use existing risk keyword, to be known
Other technology identifies that details are not described herein.
After determining multiple risk combinations, this multiple risk combination is considered as multiple alternative risk recognition rules, to alternative
Risk identification rule carries out verifying screening, that is, can determine final risk identification rule.
Under normal conditions, the mode for verifying screening, which can be, carries out hit verifying to alternative risk identification rule, that is, exists
Verify data comprising a large amount of black and white texts is concentrated, and successively carries out hit verifying using each alternative risk recognition rule, really
Black, the white text quantity of the fixed alternative rule hit, and then calculate the hit accuracy rate of the alternative rule.Reference table 3 is alternative
Risk identification rule concentrates the statistical data obtain after hit verifying in verify data.
Alternative rule |
Hit textual data |
Total textual data |
Black textual data |
Hit ratio |
[sealing face] ^ [fresh flower] |
198 |
9999 |
190 |
96% |
[phone] ^ [happiness] |
231 |
9999 |
150 |
65% |
" borrowing " ^ [money] |
330 |
9999 |
296 |
90% |
…… |
|
…… |
|
|
Table 3
As shown above, hit ratio is the ratio of the black textual data of alternative rule hit and total textual data of hit, can
To understand, the ratio is bigger, shows that the accuracy rate of the alternative rule identification is higher, can recognize that black text in practical applications
Probability it is bigger.Threshold value can be set according to actual conditions in developer, and the alternative risk that the ratio of hit is higher than threshold value is identified rule
Then it is determined as the risk identification finally to come into operation rule.
This specification embodiment also provides a kind of more specifically risk rule generation method, shown in Figure 5, this method
It may comprise steps of:
S501 counts different emoticons in the frequency of occurrences of black text collection;
S502 counts different emoticons in the frequency of occurrences of white text set;
S503 calculates the difference of the different emoticons frequency of occurrences in black text and the frequency of occurrences in white text;
Difference on the frequency is sorted from high to low by numerical values recited, filters out ranking and meet the first preset value or difference on the frequency by S504
Value meets the difference on the frequency of the second preset value, and the corresponding emoticon of the difference on the frequency for meeting preset condition is determined as risk table
Feelings;
Different risk expressions and/or different risk keywords are carried out permutation and combination by S505, raw according to permutation and combination result
At the risk identification rule comprising risk expression.
Corresponding to above method embodiment, this specification embodiment also provides a kind of risk identification rule generating means, ginseng
As shown in Figure 6, the apparatus may include rule generation module 610 and text identification modules 620.
Rule generation module 610: for calculating the emoticon feature in text according to preset risk algorithm, according to institute
It states emoticon feature and generates the risk identification rule comprising risk expression;
Text identification module 620: for obtaining text to be identified, by the risk identification rule in text to be identified into
Row matching, if successful match, is determined as the text comprising risk for the text to be identified.
This specification embodiment also provides a kind of computer equipment, includes at least memory, processor and is stored in
On reservoir and the computer program that can run on a processor, wherein processor realizes aforementioned risk text when executing described program
This recognition methods, the method include at least:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature
The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match,
The text to be identified is determined as the text comprising risk.
Fig. 7 shows one kind provided by this specification embodiment and more specifically calculates device hardware structural schematic diagram,
The equipment may include: processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus
1050.Wherein processor 1010, memory 1020, input/output interface 1030 and communication interface 1040 are real by bus 1050
The now communication connection inside equipment each other.
Processor 1010 can use general CPU (Central Processing Unit, central processing unit), micro- place
Reason device, application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or one
Or the modes such as multiple integrated circuits are realized, for executing relative program, to realize technical side provided by this specification embodiment
Case.
Memory 1020 can use ROM (Read Only Memory, read-only memory), RAM (Random Access
Memory, random access memory), static storage device, the forms such as dynamic memory realize.Memory 1020 can store
Operating system and other applications are realizing technical solution provided by this specification embodiment by software or firmware
When, relevant program code is stored in memory 1020, and execution is called by processor 1010.
Input/output interface 1030 is for connecting input/output module, to realize information input and output.Input and output/
Module can be used as component Configuration (not shown) in a device, can also be external in equipment to provide corresponding function.Wherein
Input equipment may include keyboard, mouse, touch screen, microphone, various kinds of sensors etc., output equipment may include display,
Loudspeaker, vibrator, indicator light etc..
Communication interface 1040 is used for connection communication module (not shown), to realize the communication of this equipment and other equipment
Interaction.Wherein communication module can be realized by wired mode (such as USB, cable etc.) and be communicated, can also be wirelessly
(such as mobile network, WIFI, bluetooth etc.) realizes communication.
Bus 1050 include an access, equipment various components (such as processor 1010, memory 1020, input/it is defeated
Outgoing interface 1030 and communication interface 1040) between transmit information.
It should be noted that although above equipment illustrates only processor 1010, memory 1020, input/output interface
1030, communication interface 1040 and bus 1050, but in the specific implementation process, which can also include realizing normal fortune
Other assemblies necessary to row.In addition, it will be appreciated by those skilled in the art that, it can also be only comprising real in above equipment
Component necessary to existing this specification example scheme, without including all components shown in figure.
This specification embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, the journey
Risk Text recognition methods above-mentioned is realized when sequence is executed by processor, the method includes at least:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature
The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match,
The text to be identified is determined as the text comprising risk.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method
Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data.
The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves
State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable
Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM),
Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices
Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates
Machine readable medium does not include temporary computer readable media (transitorymedia), such as the data-signal and carrier wave of modulation.
For device embodiment, since it corresponds essentially to embodiment of the method, so related place is referring to method reality
Apply the part explanation of example.The apparatus embodiments described above are merely exemplary, wherein described be used as separation unit
The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with
It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual
The purpose for needing to select some or all of the modules therein to realize this specification scheme.Those of ordinary skill in the art are not
In the case where making the creative labor, it can understand and implement.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification
Embodiment can be realized by means of software and necessary general hardware platform.Based on this understanding, this specification is implemented
Substantially the part that contributes to existing technology can be embodied in the form of software products the technical solution of example in other words,
The computer software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are to make
It is each to obtain computer equipment (can be personal computer, server or the network equipment etc.) execution this specification embodiment
Method described in certain parts of a embodiment or embodiment.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity,
Or it is realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can
To be personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play
In device, navigation equipment, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment
The combination of any several equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment
Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality
For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method
Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separate part description
Module may or may not be physically separated, can be each module when implementing this specification example scheme
Function realize in the same or multiple software and or hardware.Can also select according to the actual needs part therein or
Person's whole module achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not the case where making the creative labor
Under, it can it understands and implements.
The above is only the specific embodiment of this specification embodiment, it is noted that for the general of the art
For logical technical staff, under the premise of not departing from this specification embodiment principle, several improvements and modifications can also be made, this
A little improvements and modifications also should be regarded as the protection scope of this specification embodiment.