[go: up one dir, main page]

CN109033224A - A kind of Risk Text recognition methods and device - Google Patents

A kind of Risk Text recognition methods and device Download PDF

Info

Publication number
CN109033224A
CN109033224A CN201810713229.8A CN201810713229A CN109033224A CN 109033224 A CN109033224 A CN 109033224A CN 201810713229 A CN201810713229 A CN 201810713229A CN 109033224 A CN109033224 A CN 109033224A
Authority
CN
China
Prior art keywords
risk
text
frequency
rule
difference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810713229.8A
Other languages
Chinese (zh)
Other versions
CN109033224B (en
Inventor
周书恒
祝慧佳
赵智源
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201810713229.8A priority Critical patent/CN109033224B/en
Publication of CN109033224A publication Critical patent/CN109033224A/en
Application granted granted Critical
Publication of CN109033224B publication Critical patent/CN109033224B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a kind of Risk Text recognition methods and device, first calculates the emoticon feature in text according to preset risk algorithm, and the risk identification rule comprising risk expression is generated according to the emoticon feature;Text to be identified is obtained again, the risk identification rule is matched in text to be identified, if successful match, the text to be identified is determined as the text comprising risk.To make up missing of the conventional keyword recognition rule in terms of emoticon.

Description

A kind of Risk Text recognition methods and device
Technical field
This specification is related to internet area more particularly to a kind of Risk Text recognition methods and device.
Background technique
With the rise of mobile internet, the products such as electric business, community platform, short-sighted frequency, live streaming flourish, huge use Family group contributes to a large amount of good original contents.At the same time, greyish black production team waits for the opportune moment to go into action, and it is wide to have manufactured magnanimity rubbish It accuses, undisguisedly rubbish contents, internet product and the users such as comment, fraud information deeply hurt.
The mode of prior art anti-spam text is usually to generate the keyword rule based on text: according to black text intermediate frequency Certain type mode of numerous appearance, by manually summarizing or machine automatic mining goes out risk recognition rule, such as will " flower " " arbitrage " occurs being considered as a kind of risk identification rule simultaneously, and then is identified using risk identification rule to text.
But emoticon is widely used so that rubbish text has new upgrading direction again, a large amount of users violated in violation of rules and regulations In order to evade traditional anti-spam model, it is mingled in normal text using emoticon.And traditional keyword recognition rule is simultaneously These spcial characters are not considered, if be transformed consciously to Risk Text, replacing normal risk text can drop The low probability identified by tradition based on the anti-spam model of keyword.There is presently no a kind of preferable methods, cope with this packet Risk Text containing emoticon.
Summary of the invention
In view of the above technical problems, this specification embodiment provides a kind of Risk Text recognition methods and device, technical side Case is as follows:
According to this specification embodiment in a first aspect, provide a kind of Risk Text recognition methods, this method comprises:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match, The text to be identified is determined as the text comprising risk.
According to the second aspect of this specification embodiment, a kind of Risk Text identification device is provided, which includes:
Rule generation module: for calculating the emoticon feature in text according to preset risk algorithm, according to described Emoticon feature generates the risk identification rule comprising risk expression;
Text identification module;For obtaining text to be identified, the risk identification rule is carried out in text to be identified Matching, if successful match, is determined as the text comprising risk for the text to be identified.
According to the third aspect of this specification embodiment, a kind of computer equipment is provided, including memory, processor and deposit Store up the computer program that can be run on a memory and on a processor, wherein the processor is realized when executing described program A kind of Risk Text recognition methods, this method comprises:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match, The text to be identified is determined as the text comprising risk.
Technical solution provided by this specification embodiment, using same emoticon, the frequency of occurrences is not in black and white text With this characteristic, the biggish emoticon of frequency of occurrences difference in black and white text is extracted, and then be combined into comprising emoticon Number risk identification rule, to make up missing of the conventional keyword recognition rule in terms of emoticon.
It should be understood that above general description and following detailed description be only it is exemplary and explanatory, not This specification embodiment can be limited.
In addition, any embodiment in this specification embodiment does not need to reach above-mentioned whole effects.
Detailed description of the invention
In order to illustrate more clearly of this specification embodiment or technical solution in the prior art, below will to embodiment or Attached drawing needed to be used in the description of the prior art is briefly described, it should be apparent that, the accompanying drawings in the following description is only The some embodiments recorded in this specification embodiment for those of ordinary skill in the art can also be attached according to these Figure obtains other attached drawings.
Fig. 1 is a kind of schematic diagram of the emoticon shown in one exemplary embodiment of this specification;
Fig. 2 is a kind of flow chart of the Risk Text recognition methods shown in one exemplary embodiment of this specification;
Fig. 3 is a kind of flow chart of the risk identification rule generating method shown in one exemplary embodiment of this specification;
Fig. 4 is another flow chart of the risk identification rule generating method shown in one exemplary embodiment of this specification;
Fig. 5 is another flow chart of the risk identification rule generating method shown in one exemplary embodiment of this specification;
Fig. 6 is a kind of schematic diagram of the Risk Text identification device shown in one exemplary embodiment of this specification;
Fig. 7 is a kind of structural schematic diagram of computer equipment shown in one exemplary embodiment of this specification.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistent with this specification.On the contrary, they are only and such as institute The example of the consistent device and method of some aspects be described in detail in attached claims, this specification.
It is only to be not intended to be limiting this explanation merely for for the purpose of describing particular embodiments in the term that this specification uses Book.The "an" of used singular, " described " and "the" are also intended to packet in this specification and in the appended claims Most forms are included, unless the context clearly indicates other meaning.It is also understood that term "and/or" used herein is Refer to and includes that one or more associated any or all of project listed may combine.
It will be appreciated that though various information may be described using term first, second, third, etc. in this specification, but These information should not necessarily be limited by these terms.These terms are only used to for same type of information being distinguished from each other out.For example, not taking off In the case where this specification range, the first information can also be referred to as the second information, and similarly, the second information can also be claimed For the first information.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... " or " in response to determination ".
With the rise of mobile internet, the products such as electric business, community platform, short-sighted frequency, live streaming flourish, huge use Family group contributes to a large amount of good original contents.At the same time, greyish black production team waits for the opportune moment to go into action, and it is wide to have manufactured magnanimity rubbish It accuses, undisguised comment, fraud information, internet product and users deeply hurt.
The mode of prior art anti-spam text is usually to generate the keyword rule based on text: according to black text intermediate frequency Certain type mode of numerous appearance, by manually summarizing or machine automatic mining goes out risk recognition rule, such as will " flower " " arbitrage " occurs being considered as a kind of risk identification rule simultaneously, and then is identified using risk identification rule to text.
But emoticon is widely used so that rubbish text has new upgrading direction again, a large amount of users violated in violation of rules and regulations In order to evade traditional anti-spam model, it is mingled in normal text using emoticon, such as emoji expression or other expressions, With reference to Fig. 1.And traditional keyword recognition rule does not consider these spcial characters, if carried out consciously to Risk Text Transformation, the probability identified by tradition based on the anti-spam model of keyword can be reduced by replacing normal risk text.At present There are no a kind of preferable methods, cope with this Risk Text comprising emoticon.
In view of the above problems, this specification embodiment provides a kind of Risk Text recognition methods, and a kind of for executing The Risk Text identification device of this method.It is flat in the Internet community that the method that this specification embodiment is mentioned is mainly used in user The text of platform publication, specifically, community platform may include BBS/ forum, discussion bar, announcement board, personal knowledge is issued, group begs for By etc. online communations platform.
The Risk Text recognition methods that the present embodiment is related to is described in detail below, shown in Figure 2, this method can With the following steps are included:
S201 calculates the emoticon feature in text according to preset risk algorithm, according to the emoticon feature Generate the risk identification rule comprising risk expression;
S202 obtains text to be identified, and the risk identification rule is matched in text to be identified, if matching at The text to be identified is then determined as the text comprising risk by function.
Specifically, it is determined that at least one emoticon feature, and generated according to emoticon feature comprising risk expression The method of risk identification rule may comprise steps of referring to Fig. 3:
S301, obtains black text collection, and the black text collection is the sample set comprising multiple Risk Texts;
S302 extracts emoticon existing characteristics in black text collection, is generated according to the emoticon existing characteristics Risk identification rule comprising risk expression.
For example, carrying out feature identification in multiple Risk Texts in black text collection, expression symbolic feature is found The frequency of occurrences of "+one emoticon [phone] of two emoticons [money] " in black text collection is apparently higher than average water It is flat, then the combination of the emoticon is determined as an emoticon existing characteristics in black text collection, or, working as emoticon [money] in the text accounting be higher than some threshold value when, by accounting in text be higher than the threshold value emoticon [money] be determined as An emoticon existing characteristics in black text collection.
The feature extracting method of emoticon is not limited solely to the above citing, can set more features according to actual needs Extracting rule.
Other than extracting emoticon existing characteristics in black text collection and then generating risk identification rule, the application is also A kind of method that the risk identification rule comprising risk expression is generated according to black text collection and white text set is provided, referring to figure 4, method includes the following steps:
S401 obtains black text collection and white text set, calculate different emoticons the black text collection with it is white The difference on the frequency occurred in text collection, wherein black text collection is the sample set comprising multiple Risk Texts, white text set For the sample set comprising multiple devoid of risk texts;
Black text collection and white text collection are combined into the set of pre-prepd multiple black/white samples of text, wherein black text For the fixed Risk Text comprising rubbish contents, white text is the fixed safe text not comprising rubbish contents.It needs It should be noted that generally making the amount of text of black text collection and white text set/size phase to keep statistical result accurate as far as possible Closely.
Emoticon wherein included is extracted respectively in black text collection and white text set, and counts different expressions The frequency that symbol occurs in black text and the frequency occurred in white text.Shown in reference table 1.
Emoticon The frequency of occurrences in black text collection The frequency of occurrences in white text set
Expression [seals face] 0.05375 0.0376
Expression [fresh flower] 0.04678 0.0375
Expression [happiness] 0.04446 0.03392
Expression [phone] 0.02462 0.02442
…… ……
Table 1
After obtaining the frequency that different emoticons occur in black text and the frequency occurred in white text, calculate each Emoticon is in black, the difference of the frequency occurred in white text, shown in reference table 2:
Emoticon The difference on the frequency occurred in black and white text set
Expression [seals face] 0.01615
Expression [fresh flower] 0.00928
Expression [happiness] 0.01054
Expression [phone] 0.02462
…… ……
Table 2
It is appreciated that when some emoticon is higher than appearance frequency of the expression in white text in the frequency of occurrences of black text Rate, and when frequency difference of the emoticon in black and white text is apparently higher than other emoticons, then the emoticon has very Maximum probability is used in greyish black production clique, to carry out text insertion or replacement in rubbish contents.
The corresponding difference on the frequency of the difference emoticon is determined as frequency difference set, by the frequency difference set by S402 The middle biggish difference on the frequency of numerical value is determined as qualified difference on the frequency, by the corresponding emoticon of the qualified difference on the frequency It is determined as risk expression;
Table 2 as above, is each expression and the corresponding frequency difference set of the expression, is screened in the frequency difference set, The biggish difference on the frequency of numerical value is determined as qualified difference on the frequency.
Wherein it is determined that there are many kinds of the methods of qualified difference on the frequency, for example:
A) each difference on the frequency in frequency difference set is sorted from high to low according to numerical values recited, will be sorted forward N number of Difference on the frequency is determined as qualified difference on the frequency;
B) each difference on the frequency in frequency difference set is sorted from high to low according to numerical values recited, by the forward N% that sorts A difference on the frequency is determined as qualified difference on the frequency;
C) difference on the frequency that frequency difference set intermediate frequency rate difference is higher than preset value is determined as qualified difference on the frequency;
D) each difference on the frequency in frequency difference set is sorted from high to low according to numerical values recited, screening and sequencing is forward The difference on the frequency that wherein frequency difference is higher than preset value is determined as qualified difference on the frequency by N% difference on the frequency.
It is noted that the method for qualified difference on the frequency determined above is only for example, this specification is not constituted and limited Fixed, developer can screen the biggish difference on the frequency of numerical value by different modes according to the actual situation in frequency difference set.
Fixed different risk expressions are carried out permutation and combination by S403, and being generated according to permutation and combination result includes risk The risk identification rule of expression.
Different risk expressions are subjected to permutation and combination, using permutation and combination result as risk identification rule.For example: Determining risk expression is emoticon [fresh flower], emoticon [happiness], emoticon [phone], in the text that user delivers In, will occur simultaneously emoticon [fresh flower] in text and emoticon [happiness] is considered as a kind of risk combination, it will be same in text When there is emoticon [happiness] and emoticon [phone] is considered as another risk combination ... and so on, by different wind The permutation and combination of dangerous expression lists a variety of possible risk combinations.
In practical applications, it can be combined in conjunction with existing risk keyword, for example: it will go out simultaneously in text Existing risk keyword " borrowing " and risk expression [money] are considered as a kind of risk and combine, and in text while risk keyword will occur " regular " is considered as another risk with risk expression [phone] and combines.Risk keyword, which can use existing risk keyword, to be known Other technology identifies that details are not described herein.
After determining multiple risk combinations, this multiple risk combination is considered as multiple alternative risk recognition rules, to alternative Risk identification rule carries out verifying screening, that is, can determine final risk identification rule.
Under normal conditions, the mode for verifying screening, which can be, carries out hit verifying to alternative risk identification rule, that is, exists Verify data comprising a large amount of black and white texts is concentrated, and successively carries out hit verifying using each alternative risk recognition rule, really Black, the white text quantity of the fixed alternative rule hit, and then calculate the hit accuracy rate of the alternative rule.Reference table 3 is alternative Risk identification rule concentrates the statistical data obtain after hit verifying in verify data.
Alternative rule Hit textual data Total textual data Black textual data Hit ratio
[sealing face] ^ [fresh flower] 198 9999 190 96%
[phone] ^ [happiness] 231 9999 150 65%
" borrowing " ^ [money] 330 9999 296 90%
…… ……
Table 3
As shown above, hit ratio is the ratio of the black textual data of alternative rule hit and total textual data of hit, can To understand, the ratio is bigger, shows that the accuracy rate of the alternative rule identification is higher, can recognize that black text in practical applications Probability it is bigger.Threshold value can be set according to actual conditions in developer, and the alternative risk that the ratio of hit is higher than threshold value is identified rule Then it is determined as the risk identification finally to come into operation rule.
This specification embodiment also provides a kind of more specifically risk rule generation method, shown in Figure 5, this method It may comprise steps of:
S501 counts different emoticons in the frequency of occurrences of black text collection;
S502 counts different emoticons in the frequency of occurrences of white text set;
S503 calculates the difference of the different emoticons frequency of occurrences in black text and the frequency of occurrences in white text;
Difference on the frequency is sorted from high to low by numerical values recited, filters out ranking and meet the first preset value or difference on the frequency by S504 Value meets the difference on the frequency of the second preset value, and the corresponding emoticon of the difference on the frequency for meeting preset condition is determined as risk table Feelings;
Different risk expressions and/or different risk keywords are carried out permutation and combination by S505, raw according to permutation and combination result At the risk identification rule comprising risk expression.
Corresponding to above method embodiment, this specification embodiment also provides a kind of risk identification rule generating means, ginseng As shown in Figure 6, the apparatus may include rule generation module 610 and text identification modules 620.
Rule generation module 610: for calculating the emoticon feature in text according to preset risk algorithm, according to institute It states emoticon feature and generates the risk identification rule comprising risk expression;
Text identification module 620: for obtaining text to be identified, by the risk identification rule in text to be identified into Row matching, if successful match, is determined as the text comprising risk for the text to be identified.
This specification embodiment also provides a kind of computer equipment, includes at least memory, processor and is stored in On reservoir and the computer program that can run on a processor, wherein processor realizes aforementioned risk text when executing described program This recognition methods, the method include at least:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match, The text to be identified is determined as the text comprising risk.
Fig. 7 shows one kind provided by this specification embodiment and more specifically calculates device hardware structural schematic diagram, The equipment may include: processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus 1050.Wherein processor 1010, memory 1020, input/output interface 1030 and communication interface 1040 are real by bus 1050 The now communication connection inside equipment each other.
Processor 1010 can use general CPU (Central Processing Unit, central processing unit), micro- place Reason device, application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or one Or the modes such as multiple integrated circuits are realized, for executing relative program, to realize technical side provided by this specification embodiment Case.
Memory 1020 can use ROM (Read Only Memory, read-only memory), RAM (Random Access Memory, random access memory), static storage device, the forms such as dynamic memory realize.Memory 1020 can store Operating system and other applications are realizing technical solution provided by this specification embodiment by software or firmware When, relevant program code is stored in memory 1020, and execution is called by processor 1010.
Input/output interface 1030 is for connecting input/output module, to realize information input and output.Input and output/ Module can be used as component Configuration (not shown) in a device, can also be external in equipment to provide corresponding function.Wherein Input equipment may include keyboard, mouse, touch screen, microphone, various kinds of sensors etc., output equipment may include display, Loudspeaker, vibrator, indicator light etc..
Communication interface 1040 is used for connection communication module (not shown), to realize the communication of this equipment and other equipment Interaction.Wherein communication module can be realized by wired mode (such as USB, cable etc.) and be communicated, can also be wirelessly (such as mobile network, WIFI, bluetooth etc.) realizes communication.
Bus 1050 include an access, equipment various components (such as processor 1010, memory 1020, input/it is defeated Outgoing interface 1030 and communication interface 1040) between transmit information.
It should be noted that although above equipment illustrates only processor 1010, memory 1020, input/output interface 1030, communication interface 1040 and bus 1050, but in the specific implementation process, which can also include realizing normal fortune Other assemblies necessary to row.In addition, it will be appreciated by those skilled in the art that, it can also be only comprising real in above equipment Component necessary to existing this specification example scheme, without including all components shown in figure.
This specification embodiment also provides a kind of computer readable storage medium, is stored thereon with computer program, the journey Risk Text recognition methods above-mentioned is realized when sequence is executed by processor, the method includes at least:
The emoticon feature in text is calculated according to preset risk algorithm, is generated and is wrapped according to the emoticon feature The risk identification rule of the expression containing risk;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match, The text to be identified is determined as the text comprising risk.
Computer-readable medium includes permanent and non-permanent, removable and non-removable media can be by any method Or technology come realize information store.Information can be computer readable instructions, data structure, the module of program or other data. The example of the storage medium of computer includes, but are not limited to phase change memory (PRAM), static random access memory (SRAM), moves State random access memory (DRAM), other kinds of random access memory (RAM), read-only memory (ROM), electric erasable Programmable read only memory (EEPROM), flash memory or other memory techniques, read-only disc read only memory (CD-ROM) (CD-ROM), Digital versatile disc (DVD) or other optical storage, magnetic cassettes, tape magnetic disk storage or other magnetic storage devices Or any other non-transmission medium, can be used for storage can be accessed by a computing device information.As defined in this article, it calculates Machine readable medium does not include temporary computer readable media (transitorymedia), such as the data-signal and carrier wave of modulation.
For device embodiment, since it corresponds essentially to embodiment of the method, so related place is referring to method reality Apply the part explanation of example.The apparatus embodiments described above are merely exemplary, wherein described be used as separation unit The unit of explanation may or may not be physically separated, and component shown as a unit can be or can also be with It is not physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to actual The purpose for needing to select some or all of the modules therein to realize this specification scheme.Those of ordinary skill in the art are not In the case where making the creative labor, it can understand and implement.
As seen through the above description of the embodiments, those skilled in the art can be understood that this specification Embodiment can be realized by means of software and necessary general hardware platform.Based on this understanding, this specification is implemented Substantially the part that contributes to existing technology can be embodied in the form of software products the technical solution of example in other words, The computer software product can store in storage medium, such as ROM/RAM, magnetic disk, CD, including some instructions are to make It is each to obtain computer equipment (can be personal computer, server or the network equipment etc.) execution this specification embodiment Method described in certain parts of a embodiment or embodiment.
System, device, module or the unit that above-described embodiment illustrates can specifically realize by computer chip or entity, Or it is realized by the product with certain function.A kind of typically to realize that equipment is computer, the concrete form of computer can To be personal computer, laptop computer, cellular phone, camera phone, smart phone, personal digital assistant, media play In device, navigation equipment, E-mail receiver/send equipment, game console, tablet computer, wearable device or these equipment The combination of any several equipment.
All the embodiments in this specification are described in a progressive manner, same and similar portion between each embodiment Dividing may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device reality For applying example, since it is substantially similar to the method embodiment, so describing fairly simple, related place is referring to embodiment of the method Part explanation.The apparatus embodiments described above are merely exemplary, wherein described be used as separate part description Module may or may not be physically separated, can be each module when implementing this specification example scheme Function realize in the same or multiple software and or hardware.Can also select according to the actual needs part therein or Person's whole module achieves the purpose of the solution of this embodiment.Those of ordinary skill in the art are not the case where making the creative labor Under, it can it understands and implements.
The above is only the specific embodiment of this specification embodiment, it is noted that for the general of the art For logical technical staff, under the premise of not departing from this specification embodiment principle, several improvements and modifications can also be made, this A little improvements and modifications also should be regarded as the protection scope of this specification embodiment.

Claims (15)

1. a kind of Risk Text recognition methods, which comprises
The emoticon feature in text is calculated according to preset risk algorithm, generating according to the emoticon feature includes wind The risk identification rule of dangerous expression;
Text to be identified is obtained, the risk identification rule is matched in text to be identified, if successful match, by institute It states text to be identified and is determined as the text comprising risk.
2. the method as described in claim 1, the emoticon feature calculated in text according to preset risk algorithm will The emoticon feature is determined as the risk identification rule comprising risk expression, comprising:
Black text collection is obtained, the black text collection is the sample set comprising multiple Risk Texts;
Emoticon existing characteristics are extracted in black text collection, generating according to the emoticon existing characteristics includes risk table The risk identification rule of feelings.
3. the method as described in claim 1, the emoticon feature calculated in text according to preset risk algorithm will The emoticon feature is determined as the risk identification rule comprising risk expression, comprising:
Black text collection and white text set are obtained, calculates different emoticons in the black text collection and white text set The difference on the frequency of appearance, wherein black text collection is the sample set comprising multiple Risk Texts, and white text collection is combined into comprising multiple The sample set of devoid of risk text;
The corresponding difference on the frequency of the difference emoticon is determined as frequency difference set, numerical value in the frequency difference set is larger Difference on the frequency be determined as qualified difference on the frequency, the corresponding emoticon of the qualified difference on the frequency is determined as risk Expression;
Fixed risk expression is subjected to permutation and combination, the risk identification comprising risk expression is generated according to permutation and combination result Rule.
4. method as claimed in claim 3, described to be determined as meeting by the biggish difference on the frequency of numerical value in the frequency difference set The difference on the frequency of condition, comprising:
Difference on the frequency in the frequency difference set is ranked up from high to low by numerical values recited, ranking is higher than the first preset value Or frequency difference is determined as qualified difference on the frequency greater than the difference on the frequency of the second preset value.
5. method as claimed in claim 3, described that the different risk expressions are carried out permutation and combination, according to permutation and combination knot Fruit generates the risk identification rule comprising risk expression, comprising:
Different risk expressions and/or different risk keywords are subjected to permutation and combination, generating according to permutation and combination result includes wind The risk identification rule of dangerous expression, the risk keyword is identified according to preset keyword risk identification rule.
6. method as claimed in claim 3, described to generate the risk identification rule comprising risk expression according to permutation and combination result Then, comprising:
The alternative risk recognition rule comprising risk expression is generated according to permutation and combination result;
Hit verifying is carried out to the alternative risk recognition rule using the validation data set comprising black and white text, verifying will be passed through Alternative risk recognition rule be determined as ultimate risk recognition rule.
7. method as claimed in claim 6, validation data set of the use comprising black and white text knows the alternative risk Rule does not carry out hit verifying, will be determined as ultimate risk recognition rule by the alternative risk recognition rule verified, comprising:
Hit verifying is carried out to the alternative risk recognition rule using the validation data set comprising black and white text, counts different wind The black/white textual data of dangerous recognition rule hit simultaneously calculates its hit accuracy rate;
Determine that hit accuracy rate is greater than the alternative risk recognition rule of preset threshold by hit verifying, and determines it as final Risk identification rule.
8. a kind of Risk Text identification device, described device include:
Rule generation module: for calculating the emoticon feature in text according to preset risk algorithm, according to the expression Symbolic feature generates the risk identification rule comprising risk expression;
Text identification module;For obtaining text to be identified, the risk identification rule is matched in text to be identified, If successful match, the text to be identified is determined as the text comprising risk.
9. device as claimed in claim 8, the rule generation module, are specifically used for:
Black text collection is obtained, the black text collection is the sample set comprising multiple Risk Texts;
Emoticon existing characteristics are extracted in black text collection, generating according to the emoticon existing characteristics includes risk table The risk identification rule of feelings.
10. device as claimed in claim 8, the rule generation module, are specifically used for:
Black text collection and white text set are obtained, calculates different emoticons in the black text collection and white text set The difference on the frequency of appearance, wherein black text collection is the sample set comprising multiple Risk Texts, and white text collection is combined into comprising multiple The sample set of devoid of risk text;
The corresponding difference on the frequency of the difference emoticon is determined as frequency difference set, numerical value in the frequency difference set is larger Difference on the frequency be determined as qualified difference on the frequency, the corresponding emoticon of the qualified difference on the frequency is determined as risk Expression;
Fixed risk expression is subjected to permutation and combination, the risk identification comprising risk expression is generated according to permutation and combination result Rule.
11. device as claimed in claim 10, described to be determined as according with by the biggish difference on the frequency of numerical value in the frequency difference set The difference on the frequency of conjunction condition, comprising:
Difference on the frequency in the frequency difference set is ranked up from high to low by frequency difference, ranking is higher than the first preset value Or frequency difference is determined as qualified difference on the frequency greater than the difference on the frequency of the second preset value.
12. device as claimed in claim 10, described that the different risk expressions are carried out permutation and combination, according to permutation and combination As a result the risk identification rule comprising risk expression is generated, comprising:
Different risk expressions and/or different risk keywords are subjected to permutation and combination, generating according to permutation and combination result includes wind The risk identification rule of dangerous expression, the risk keyword is identified according to preset keyword risk identification rule.
13. device as claimed in claim 10, described to generate the risk identification comprising risk expression according to permutation and combination result Rule, comprising:
The alternative risk recognition rule comprising risk expression is generated according to permutation and combination result;
Hit verifying is carried out to the alternative risk recognition rule using the validation data set comprising black and white text, verifying will be passed through Alternative risk recognition rule be determined as ultimate risk recognition rule.
14. device as claimed in claim 11, validation data set of the use comprising black and white text is to the alternative risk Recognition rule carries out hit verifying, will be determined as ultimate risk recognition rule by the alternative risk recognition rule verified, comprising:
Hit verifying is carried out to the alternative risk recognition rule using the validation data set comprising black and white text, counts different wind The black/white textual data of dangerous recognition rule hit simultaneously calculates its hit accuracy rate;
Determine that hit accuracy rate is greater than the alternative risk recognition rule of preset threshold by hit verifying, and determines it as final Risk identification rule.
15. a kind of computer equipment including memory, processor and stores the meter that can be run on a memory and on a processor Calculation machine program, wherein the processor realizes the method as described in claim 1 when executing described program.
CN201810713229.8A 2018-06-29 2018-06-29 Risk text recognition method and device Active CN109033224B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810713229.8A CN109033224B (en) 2018-06-29 2018-06-29 Risk text recognition method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810713229.8A CN109033224B (en) 2018-06-29 2018-06-29 Risk text recognition method and device

Publications (2)

Publication Number Publication Date
CN109033224A true CN109033224A (en) 2018-12-18
CN109033224B CN109033224B (en) 2022-02-01

Family

ID=65522208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810713229.8A Active CN109033224B (en) 2018-06-29 2018-06-29 Risk text recognition method and device

Country Status (1)

Country Link
CN (1) CN109033224B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287493A (en) * 2019-06-28 2019-09-27 中国科学技术信息研究所 Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN113742557A (en) * 2021-08-10 2021-12-03 北京深演智能科技股份有限公司 Method and device for recommending application program identification rules
CN113762973A (en) * 2021-05-24 2021-12-07 腾讯科技(深圳)有限公司 Data processing method and device, computer readable medium and electronic equipment
CN115065509A (en) * 2022-05-27 2022-09-16 中电长城网际系统应用有限公司 Method and device for identifying risk of statistical inference attack based on deviation function

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
JP2015099289A (en) * 2013-11-20 2015-05-28 日本電信電話株式会社 Utterance key word extraction device, key word extraction system using the device, method and program thereof
US20150227497A1 (en) * 2012-09-17 2015-08-13 Tencent Technology (Shenzhen) Company Limited Method and apparatus for identifying garbage template article
CN104866465A (en) * 2014-02-25 2015-08-26 腾讯科技(深圳)有限公司 Sensitive text detection method and device
CN105589845A (en) * 2015-12-18 2016-05-18 北京奇虎科技有限公司 Junk text recognizing method, device and system
CN107171937A (en) * 2017-05-11 2017-09-15 翼果(深圳)科技有限公司 The method and system of anti-rubbish mail
CN107229638A (en) * 2016-03-24 2017-10-03 北京搜狗科技发展有限公司 A kind of text message processing method and device
CN107566391A (en) * 2017-09-20 2018-01-09 上海斗象信息科技有限公司 Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184188A (en) * 2011-04-15 2011-09-14 百度在线网络技术(北京)有限公司 Method and equipment for determining sensitivity of target text
US20150227497A1 (en) * 2012-09-17 2015-08-13 Tencent Technology (Shenzhen) Company Limited Method and apparatus for identifying garbage template article
JP2015099289A (en) * 2013-11-20 2015-05-28 日本電信電話株式会社 Utterance key word extraction device, key word extraction system using the device, method and program thereof
CN104866465A (en) * 2014-02-25 2015-08-26 腾讯科技(深圳)有限公司 Sensitive text detection method and device
CN105589845A (en) * 2015-12-18 2016-05-18 北京奇虎科技有限公司 Junk text recognizing method, device and system
CN107229638A (en) * 2016-03-24 2017-10-03 北京搜狗科技发展有限公司 A kind of text message processing method and device
CN107171937A (en) * 2017-05-11 2017-09-15 翼果(深圳)科技有限公司 The method and system of anti-rubbish mail
CN107566391A (en) * 2017-09-20 2018-01-09 上海斗象信息科技有限公司 Domain identification plus the method for the topic identification structure machine learning model detection dark chain of webpage

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110287493A (en) * 2019-06-28 2019-09-27 中国科学技术信息研究所 Risk phrase chunking method, apparatus, electronic equipment and storage medium
CN110287493B (en) * 2019-06-28 2023-04-18 中国科学技术信息研究所 Risk phrase identification method and device, electronic equipment and storage medium
CN113762973A (en) * 2021-05-24 2021-12-07 腾讯科技(深圳)有限公司 Data processing method and device, computer readable medium and electronic equipment
CN113742557A (en) * 2021-08-10 2021-12-03 北京深演智能科技股份有限公司 Method and device for recommending application program identification rules
CN115065509A (en) * 2022-05-27 2022-09-16 中电长城网际系统应用有限公司 Method and device for identifying risk of statistical inference attack based on deviation function
CN115065509B (en) * 2022-05-27 2024-04-02 中电长城网际系统应用有限公司 Risk identification method and device for statistical inference attack based on deviation function

Also Published As

Publication number Publication date
CN109033224B (en) 2022-02-01

Similar Documents

Publication Publication Date Title
US20190342415A1 (en) Event information push method, event information push apparatus, and storage medium
CN109033224A (en) A kind of Risk Text recognition methods and device
CN103188139B (en) A kind of information displaying method of recommending friends and device
US20190087505A1 (en) Method, apparatus, and computer-readable storage medium for grouping social network nodes
EP3061017A1 (en) Systems and methods for determining influencers in a social data network
CN109213857A (en) A kind of fraud recognition methods and device
US20180032907A1 (en) Detecting abusive language using character n-gram features
US8751588B2 (en) Message thread clustering
CN103684969A (en) Message handling method and message handling system
CN106326391A (en) Method and device for recommending multimedia resources
CN108829769B (en) Suspicious group discovery method and device
CN110874396B (en) Keyword extraction method and device and computer storage medium
CN104301207B (en) Web information processing method and device
WO2020003109A1 (en) Facet-based query refinement based on multiple query interpretations
WO2014113405A2 (en) Systems and methods for processing and displaying user-generated content
CN107634897A (en) Group recommends method and apparatus
CN105378717A (en) Method for user categorization in social media, computer program, and computer
CN110427546A (en) A kind of information displaying method and device
CN109408714A (en) A kind of recommender system and method for multi-model fusion
CN106681980A (en) Method and device for analyzing junk short messages
CN111062490B (en) Method and device for processing and identifying network data containing private data
CN110347934A (en) A kind of text data filtering method, device and medium
US8549086B2 (en) Data clustering
CN104102662A (en) Method and device for determining interest and preference similarity of users
CN108804676A (en) A kind of model sort method, device, equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40001885

Country of ref document: HK

TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20200924

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

GR01 Patent grant
GR01 Patent grant