[go: up one dir, main page]

CN106815242A - Textual resources data detection method and device - Google Patents

Textual resources data detection method and device Download PDF

Info

Publication number
CN106815242A
CN106815242A CN201510859762.1A CN201510859762A CN106815242A CN 106815242 A CN106815242 A CN 106815242A CN 201510859762 A CN201510859762 A CN 201510859762A CN 106815242 A CN106815242 A CN 106815242A
Authority
CN
China
Prior art keywords
textual resources
data
resources data
contact method
matched
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510859762.1A
Other languages
Chinese (zh)
Inventor
陈尔晓
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201510859762.1A priority Critical patent/CN106815242A/en
Publication of CN106815242A publication Critical patent/CN106815242A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The present invention relates to a kind of textual resources data detection method and device.The described method comprises the following steps:Obtain textual resources data;Textual resources data are matched respectively with crucial character word stock and contact method feature database;When the textual resources data match with the crucial character word stock and the textual resources data match with the contact method feature database, the textual resources data are target text resource data.Above-mentioned textual resources data detection method and device, matched respectively with crucial character word stock and contact method feature database by by the textual resources data of acquisition, when textual resources data match with crucial character word stock and contact method feature database, then text resource data is target text resource data, so, matching with contact method is matched by crucial character word stock, target text resource data can accurately be identified, matched compared to simple crucial character word stock, accuracy is high, reduces False Rate.

Description

Textual resources data detection method and device
Technical field
The present invention relates to Data Detection field, more particularly to a kind of textual resources data detection method and device.
Background technology
With the development of Internet technology, the life for giving people brings great convenience, but webpage quantity Rapid growth and UGC (User Generated Content, user's original content) increase and also bring The problem of information overload.Resource data required for how effectively screening or filter out from the data of magnanimity Or filter out unwanted resource data and seem extremely important to facilitate user to browse.
For with the topic circle Commentary Systems of social networks, generally there is the resource data publisher of various regions at this Resource data issue is carried out on platform, therefore the platform needs rapidly to recognize these resource datas, and determine Whether these resource datas are filtered out, to safeguard an environment for the browsing content of health.Traditional number of resources According to the mode of detection typically by search key, the resource data matched with crucial character word stock is considered as The resource data of required detection, but the Semantic of the text message in resource data is stronger in some cases, Under given conditions, the expressed meaning is completely different for some vocabulary, and matching inspection is carried out using keyword Survey, False Rate is higher.
The content of the invention
Based on this, it is necessary to in traditional resource data detection method using keyword match False Rate compared with A kind of problem high, there is provided textual resources data detection method, can reduce False Rate.
Additionally, there is a need to a kind of textual resources data detection device of offer, False Rate can be reduced.
A kind of textual resources data detection method, comprises the following steps:
Obtain textual resources data;
The textual resources data are matched with crucial character word stock, and by the textual resources data with Contact method feature database is matched;
When the textual resources data match with the crucial character word stock and the textual resources data with it is described When contact method feature database matches, the textual resources data are target text resource data.
A kind of textual resources data detection device, including:
Acquisition module, for obtaining textual resources data;
Matching module, for the textual resources data to be matched with crucial character word stock, and will be described Textual resources data are matched with contact method feature database;
Screening module, for matching with the crucial character word stock and the text when the textual resources data When resource data matches with the contact method feature database, the textual resources data are target text resource Data.
Above-mentioned textual resources data detection method and device, by the textual resources data and keyword that will obtain Dictionary is matched, and textual resources data are matched with contact method feature database, works as textual resources Data match with crucial character word stock and contact method feature database, then text resource data is target text Resource data, in this way, matching the matching with contact method by crucial character word stock, can accurately identify Target text resource data, is matched compared to simple crucial character word stock, and accuracy is high, reduces mistake Sentence rate.
Brief description of the drawings
Figure 1A is the internal structure schematic diagram of terminal in one embodiment;
Figure 1B is the internal structure schematic diagram of server in one embodiment;
Fig. 2 is the flow chart of one embodiment Chinese version resource data detection method;
Fig. 3 is the flow chart of another embodiment Chinese version resource data detection method;
Fig. 4 is the schematic diagram of the information issued on one embodiment Chinese version resource data distribution platform;
Fig. 5 is the schematic diagram of the information that advertisement text resource data is detected from Fig. 4;
Fig. 6 is the structured flowchart of one embodiment Chinese version resource data detection means;
Fig. 7 is the structured flowchart of another embodiment Chinese version resource data detection means.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing and reality Example is applied, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only Only it is used to explain the present invention, is not intended to limit the present invention.
Figure 1A is the internal structure schematic diagram of terminal in one embodiment.As shown in Figure 1A, the terminal includes logical Cross processor, non-volatile memory medium, internal memory, network interface, the display screen and defeated of system bus connection Enter device.Wherein, the operating system that is stored with the non-volatile memory medium of terminal and textual resources data are examined Device is surveyed, text resource data detection means is used to realize a kind of textual resources data detection method.Terminal On can also be stored with crucial character word stock and contact method feature database etc..The processor is used to provide calculating and controls Ability, supports the operation of whole terminal, and is arranged to perform textual resources data detection method.Eventually The operation that the textual resources data detection device in non-volatile memory medium is saved as in end provides environment, Network interface is used to carry out network service with server, such as sends data access request to server, receives clothes Data that business device is returned etc..The display screen of terminal can be LCDs or electric ink display screen etc., Input unit can be button, the rail set on the touch layer, or terminal enclosure covered on display screen Mark ball or Trackpad, or external keyboard, Trackpad or mouse etc..The terminal can be mobile phone, Panel computer or personal digital assistant etc..It will be understood by those skilled in the art that the structure shown in Figure 1A, Only the block diagram of the part-structure related to application scheme, does not constitute and application scheme is applied to The restriction of terminal thereon, specific terminal can include than more or less part shown in figure, or Some parts are combined, or is arranged with different parts.
Figure 1B is the internal structure schematic diagram of server in one embodiment.As shown in Figure 1B, the server bag Processor, storage medium, internal memory, display screen, input unit and the network connected by system bus is included to connect Mouthful.Wherein, the non-volatile memory medium of the server is stored with operating system, database and textual resources Data detection device, be stored with crucial character word stock and contact method feature database etc. in database, text resource Data detection device is used to realize a kind of textual resources data detection method.The processor of the server is used to carry For calculating and control ability, the operation of whole server is supported, and be arranged to perform textual resources data Detection method.The operation that the textual resources data detection device in storage medium is saved as in the server is provided Environment.The display screen of the server can be LCDs or electric ink display screen etc., input unit Can be button, the trace ball or tactile set on the touch layer, or terminal enclosure covered on display screen Control plate, or external keyboard, Trackpad or mouse etc..The network interface of the server is used for according to this Communicate by network connection with outside terminal, such as the data access request of receiving terminal transmission and to end End returned data etc..Server can with independent server or multiple server groups into server set Group realizes.It will be understood by those skilled in the art that the structure shown in Figure 1B, only with the application side The block diagram of the related part-structure of case, does not constitute the limit of the server being applied thereon to application scheme Fixed, specific server can include than more or less part shown in figure, or combine some parts, Or arranged with different parts.
It should be noted that the textual resources data detection method that the present invention is provided can be applied to terminal or service On device.
Fig. 2 is the flow chart of one embodiment Chinese version resource data detection method.As shown in Fig. 2 a kind of Textual resources data detection method, comprises the following steps:
Step 202, obtains textual resources data.
Specifically, textual resources data can be any textual resources data obtained from network, such as news Data, comment data, product data etc. or the arbitrary picture resource data obtained from network, then Textual resources data for being identified being converted to picture resource data etc., or from network obtain appoint The audio or video resource data of meaning, the textual resources number being converted to audio or video resource data According to.Resource data can be captured from the network data of magnanimity by resource gripping tool.
Step 204, text resource data is matched with crucial character word stock, and by text number of resources Matched according to contact method feature database.
Specifically, crucial character word stock is the set of the keyword of the advance a certain class resource characteristicses of representative collected. The corresponding crucial character word stock that can be respectively set up according to different resource type.Crucial character word stock may include a certain The one kind such as crucial character word stock, distribution series advertisements key character word stock, the pornographic series advertisements key character word stock of class article Or it is various.
For example in promoting the crucial character word stock of series advertisements there be keyword:Part-time, Amalgamation-becoming duties, money paid for odd jobs, high salary, firewood Provide, make money, earning money, day ties, runs a shop, cover treasured, Tao treasured, New-Light treasured, cry loudly treasured, grape treasured, Tao Bao, hide it is precious, Tao is precious, flood is precious, cover is precious, draw that precious, great waves are precious, it is precious to escape, draw guarantor, peach treasured, nine foldings, eight foldings, seven foldings, six Folding, five foldings, four fold, three foldings, eighty percent discount, a folding, shop, Taobao, pocket money, employ sincerely, business personnel, Alipay, in unlimited time, do not limit that place, time do not limit, place does not limit, taobao, net purchase, the free time, trick Engage, base pay, cash pledge, advertisement, agency, it is part-time, and ear only, earn money, 1 folding, 2 foldings, 3 foldings, 4 Folding, 5 foldings, 6 foldings, 7 foldings, 8 foldings, 9 foldings, take in, say knot, Gong Capital, being in can do, hourly worker, have Computer, wallet, full-time mother, wholesale etc..
There is keyword in the crucial character word stock of pornographic series advertisements:Ripe female, young married woman, it is plentiful, tempt, commit incest.
Textual resources data are matched with crucial character word stock, if including keyword in textual resources data Keyword in dictionary, then textual resources data match with crucial character word stock, otherwise, textual resources data Do not matched that with crucial character word stock.
Contact method feature database may include contact method dictionary, or, it may include contact method dictionary and continuous Character quantity threshold value.Contact method dictionary is the set of the keyword of the advance representative contact method collected.Connection Be may include in mode dictionary to browse, contact, scratch, button button, Qiu Qiu, wechat, common vetch letter, cocoa, net Stand, have mobile phone, have computer, support computer, support mobile phone, support Android, support android, see above Name, see above title, chatroom, space, button space, q photograph albums, buckle photograph album, word remove, kowtow, Prestige, V believe, add group ,+skirt, customer service, V new, Pull Pull, Wei letter, consulting, understanding details, download ground Location, private chat, be letter etc..Continuation character amount threshold can be arranged as required to, for example continuation character quantity threshold Be worth is 5,6,7 etc..
Textual resources data are matched with contact method feature database, if comprising contact in textual resources data Data in mode feature database, then textual resources data match with contact method feature database, if textual resources number Not comprising data in contact method feature database in, then textual resources data are mismatched with contact method feature database.
It should be noted that text resource data is matched with crucial character word stock, and by the text Resource data carries out matching no sequencing with contact method feature database, can perform side by side, also can one Performed before another.
Step 206, when text resource data and the crucial character word stock match and text resource data with should When contact method feature database matches, text resource data is target text resource data.
Specifically, when textual resources data match simultaneously with crucial character word stock and contact method feature database, Text resource data is then target text resource data.The thing that target text resource data can be screened for needed for Product data or advertisement text resource data etc..If target text resource data is advertisement text resource data, Advertisement text resource data can be filtered.
Above-mentioned textual resources data detection method, enters by by the textual resources data of acquisition with crucial character word stock Row matching, and textual resources data is matched with contact method feature database, when textual resources data and Crucial character word stock and contact method feature database match, then text resource data is target text number of resources According in this way, by the matching of crucial character word stock and the matching of contact method, can accurately identify target text This resource data, is matched compared to simple crucial character word stock, and accuracy is high, reduces False Rate.
In one embodiment, crucial character word stock includes the crucial character word stock of different resource type;
The step of text resource data is matched with crucial character word stock includes:
Text resource data is matched respectively with the crucial character word stock of different resource type;
Keyword in the crucial character word stock for including certain resource type in text resource data, the then money Source data matches with the crucial character word stock of the resource type.
For example, crucial character word stock includes promoting series advertisements dictionary and pornographic series advertisements dictionary.By textual resources number Matched according to distribution series advertisements dictionary, if comprising the pass promoted in series advertisements dictionary in textual resources data Key word, then text resource data with promote series advertisements dictionary match.
In one embodiment, contact method feature database includes contact method dictionary;Then by text number of resources Include according to the step of matching with contact method feature database:By text resource data and the contact method word Storehouse is matched.
Specifically, textual resources data are matched with contact method dictionary, if being wrapped in textual resources data Data in dictionary containing contact method, then textual resources data match with contact method dictionary, otherwise, text This resource data is mismatched with contact method dictionary.
In one embodiment, the contact method feature database includes contact method dictionary and continuation character quantity threshold Value;The step of then text resource data is matched with contact method feature database includes:
Text resource data is matched with the contact method dictionary, and is judged text resource data In with the presence or absence of continuation character quantity be more than or equal to continuation character amount threshold;
When text resource data and the contact method dictionary match and exist in text resource data continuous Character quantity is more than or equal to continuation character amount threshold, then text resource data and the contact method feature Storehouse matches, and otherwise text resource data is mismatched with the contact method feature database.
Specifically, continuation character amount threshold can be arranged as required to, and such as 5,6,7.
Contact method feature database includes contact method dictionary and continuation character amount threshold, is provided by by text Source data is matched with contact method dictionary, and judges to whether there is continuation character number in textual resources data Amount is more than or equal to continuation character amount threshold, when textual resources data match and deposit with contact method dictionary It is more than or equal to continuation character amount threshold in continuation character quantity, can more accurately detects target text Resource data.
Fig. 3 is the flow chart of another embodiment Chinese version resource data detection method.As shown in figure 3, one Textual resources data detection method is planted, is comprised the following steps:
Step 302, obtains textual resources data.
Specifically, textual resources data can be any textual resources data obtained from network, such as news Data, comment data, product data etc. or the arbitrary picture resource data obtained from network, then Textual resources data for being identified being converted to picture resource data etc., or from network obtain appoint The audio or video resource data of meaning, the textual resources number being converted to audio or video resource data According to.Resource data can be captured from the network data of magnanimity by resource gripping tool.
Step 304, pre-processes to text resource data, obtains pretreated textual resources data.
Specifically, textual resources data are pre-processed, remove the interference informations such as expression, punctuation mark, The text informations such as remaining Chinese, numeral, letter, that is, obtain pretreated textual resources data.
For expression, expression directly can be replaced by sky using character string, so as to remove expression.Remove expression Afterwards, pattern match can be carried out using regular expression, you can remaining Chinese, numeral, letter etc., obtains pre- Textual resources data after treatment.
Pretreated textual resources data are carried out word segmentation processing by step 306, obtain text resource data Participle.
Specifically, Chinese in pretreated textual resources data is carried out into word segmentation processing, obtains different dividing Word phrase, for example, carry out word segmentation processing and obtain to " I am Chinese, and family is in Shenzhen ", " I ", " Chinese ", The participle phrases such as " family ", " Shenzhen ".
Step 308, the participle of textual resources data is matched with crucial character word stock, and by textual resources The participle of data is matched with contact method feature database.
Specifically, crucial character word stock is the set of the keyword of the advance a certain class resource characteristicses of representative collected. The corresponding crucial character word stock that can be respectively set up according to different resource type.Crucial character word stock may include a certain The one kind such as crucial character word stock, distribution series advertisements key character word stock, the pornographic series advertisements key character word stock of class article Or it is various.
For example in promoting the crucial character word stock of series advertisements there be keyword:Part-time, Amalgamation-becoming duties, money paid for odd jobs etc..
There is keyword in the crucial character word stock of pornographic series advertisements:Ripe female, young married woman, it is plentiful, tempt, commit incest.
The participle of pretreated textual resources data is matched with crucial character word stock, if textual resources number According to participle in include keyword in crucial character word stock, then pretreated textual resources data with it is crucial Character word stock matches, and otherwise, pretreated textual resources data are not matched that with crucial character word stock.
Contact method feature database may include contact method dictionary, or, it may include contact method dictionary and continuous Character quantity threshold value.Contact method dictionary is the set of the keyword of the advance representative contact method collected.Connection Be may include in mode dictionary to browse, contact, scratch, button button, Qiu Qiu, wechat, common vetch letter, cocoa, net Stand, have mobile phone, have computer, support computer, support mobile phone, support Android, support android, see above Name, see above title, chatroom, space, button space, q photograph albums, buckle photograph album, word remove, kowtow, Prestige, V believe, add group ,+skirt, customer service, V new, Pull Pull, Wei letter, consulting, understanding details, download ground Location, private chat, be letter etc..Continuation character amount threshold can be arranged as required to, for example continuation character quantity threshold Be worth is 5,6,7 etc..
The participle of pretreated textual resources data is matched with contact method feature database, if pretreatment Data in contact method feature database are included in textual resources data afterwards, then pretreated textual resources data Match with contact method feature database, if not including contact method feature in pretreated textual resources data Data in storehouse, then pretreated textual resources data and contact method feature database are mismatched.
It should be noted that pretreated textual resources data are matched with crucial character word stock, and Pretreated textual resources data and contact method feature database are carried out matching no sequencing, can be arranged side by side Perform, also can one perform before another.
Additionally, can also the participle of textual resources data classify that the type of participle be obtained, according to participle Type is matched with keywords database or contact method feature database.The type of participle can be keyword type or connection It is mode characteristic type.
Step 310, when participle and the crucial character word stock of textual resources data match and textual resources data When participle matches with the contact method feature database, text resource data is target text resource data.
Specifically, when the participle and crucial character word stock and contact method feature of pretreated textual resources data Storehouse matches simultaneously when, text resource data is then target text resource data.Target text resource data The product data or advertisement text resource data that can be screened for needed for etc..If target text resource data is wide Textual resources data are accused, advertisement text resource data can be filtered.
Additionally, when target text resource data is advertisement text resource data, after step 310, may be used also Including:Filtration treatment is carried out to advertisement text resource data.
Above-mentioned textual resources data detection method, pre-processes by the textual resources data for obtaining, can Remove interference information, improve the accuracy of matching, then by pretreated textual resources data and crucial words Storehouse is matched, and textual resources data are matched with contact method feature database, when textual resources number Match according to crucial character word stock and contact method feature database, then text resource data is target text money Source data, in this way, matching the matching with contact method by crucial character word stock, can accurately identify mesh Mark textual resources data, are matched compared to simple crucial character word stock, and accuracy is high, reduces erroneous judgement Rate;And there is high efficiency, low-cost, fast and flexible.
In one embodiment, above-mentioned file resource data method is applied in browser or application program, mesh Mark textual resources data are advertisement text resource data, then above-mentioned file resource data method also includes:Clear Look in device or application program carries out filtration treatment to advertisement text resource data.
Specifically, browser can be various web browsers etc..Application program can be to provide answering for various services With APP etc..
Above-mentioned textual resources data detection method can be applied to various ad distribution platforms and carry out purposes of commercial detection, inspection Advertisement text filtering is carried out after having surveyed, can also be applied to third party website carries out advertisement text or pornography Detection and filtering etc..The application of textual resources data detection method is described with a specific application scenarios below Principle.In this application scene, textual resources data are the comment information in textual resources data publishing platform. Target text resource data is advertisement text resource data.
As shown in figure 4, issue has a plurality of information on resource data distribution platform, such as user A issues " the newest film shown, mission spy 5, terminator:Genesis etc., only need 5 yuan and may be viewed by, prestige ys1111”;" bidding for the Winter Olympic games successfully in Beijing " of user B issues;User C issue " one is earned The chance of extra income, you are ready to holdEarning dozens of yuan within one day has no problem, as long as you have the time!It is ready to pay Go out.Interested parties add q510100 ".
Issued in textual resources data publishing platform by by above-mentioned textual resources data detection method, being obtained A plurality of information, the textual resources data of each user issue are matched with advertisement keyword character word stock, The textual resources data by each user's issue are matched with contact method feature database again, obtain user A The textual resources data issued with user C match with advertisement keyword character word stock and contact method feature database, then The textual resources data of user A and user C issues are advertisement text resource data, as shown in figure 5, detection The textual resources data for going out user A and user C issues are advertisement text resource data.Detect user A and The textual resources data of user C issues are after advertisement text resource data, can to carry out filtration treatment, after filtering The textual resources data that this is filtered out will not be shown on receiving side terminal.
Fig. 6 is the structured flowchart of one embodiment Chinese version resource data detection means.As shown in fig. 6, one Plant textual resources data detection device, including acquisition module 610, matching module 620 and screening module 630. Wherein:
Acquisition module 610 is used to obtain textual resources data.
Specifically, textual resources data can be any textual resources data obtained from network, such as news Data, comment data, product data etc..Can be captured from the network data of magnanimity by resource gripping tool Textual resources data.
Matching module 620 is used to be matched text resource data with key character word stock, and by this article This resource data is matched with contact method feature database.
Specifically, crucial character word stock is the set of the keyword of the advance a certain class resource characteristicses of representative collected. The corresponding crucial character word stock that can be respectively set up according to different resource type.Crucial character word stock may include a certain The one kind such as crucial character word stock, distribution series advertisements key character word stock, the pornographic series advertisements key character word stock of class article Or it is various.
For example in promoting the crucial character word stock of series advertisements there be keyword:Part-time, Amalgamation-becoming duties, money paid for odd jobs, high salary, firewood Provide, make money, earning money, day ties, runs a shop, cover treasured, Tao treasured, New-Light treasured, cry loudly treasured, grape treasured, Tao Bao, hide it is precious, Tao is precious, flood is precious, cover is precious, draw that precious, great waves are precious, it is precious to escape, draw guarantor, peach treasured, nine foldings, eight foldings, seven foldings, six Folding, five foldings, four fold, three foldings, eighty percent discount, a folding, shop, Taobao, pocket money, employ sincerely, business personnel, Alipay, in unlimited time, do not limit that place, time do not limit, place does not limit, taobao, net purchase, the free time, trick Engage, base pay, cash pledge, advertisement, agency, it is part-time, and ear only, earn money, 1 folding, 2 foldings, 3 foldings, 4 Folding, 5 foldings, 6 foldings, 7 foldings, 8 foldings, 9 foldings, take in, say knot, Gong Capital, being in can do, hourly worker, have Computer, wallet, full-time mother, wholesale etc..
There is keyword in the crucial character word stock of pornographic series advertisements:Ripe female, young married woman, it is plentiful, tempt, commit incest.
Textual resources data are matched with crucial character word stock, if including keyword in textual resources data Keyword in dictionary, then textual resources data match with crucial character word stock, otherwise, textual resources data Do not matched that with crucial character word stock.
Contact method feature database may include contact method dictionary, or, it may include contact method dictionary and continuous Character quantity threshold value.Contact method dictionary is the set of the keyword of the advance representative contact method collected.Connection Be may include in mode dictionary to browse, contact, scratch, button button, Qiu Qiu, wechat, common vetch letter, cocoa, net Stand, have mobile phone, have computer, support computer, support mobile phone, support Android, support android, see above Name, see above title, chatroom, space, button space, q photograph albums, buckle photograph album, word remove, kowtow, Prestige, V believe, add group ,+skirt, customer service, V new, Pull Pull, Wei letter, consulting, understanding details, download ground Location, private chat, be letter etc..Continuation character amount threshold can be arranged as required to, for example continuation character quantity threshold Be worth is 5,6,7 etc..
Textual resources data are matched with contact method feature database, if comprising contact in textual resources data Data in mode feature database, then textual resources data match with contact method feature database, if textual resources number Not comprising data in contact method feature database in, then textual resources data are mismatched with contact method feature database.
Screening module 630 is used to match and text resource when text resource data and the crucial character word stock When data match with the contact method feature database, text resource data is target text resource data.
Specifically, when textual resources data match simultaneously with crucial character word stock and contact method feature database, Text resource data is then target text resource data.The thing that target text resource data can be screened for needed for Product data or advertisement text resource data etc..If target text resource data is advertisement text resource data, Advertisement text resource data can be filtered.
Above-mentioned textual resources data detection device, enters by by the textual resources data of acquisition with crucial character word stock Row matching, and textual resources data is matched with contact method feature database, when textual resources data and Crucial character word stock and contact method feature database match, then text resource data is target text number of resources According in this way, by the matching of crucial character word stock and the matching of contact method, can accurately identify target text This resource data, is matched compared to simple crucial character word stock, and accuracy is high, reduces False Rate.
In one embodiment, crucial character word stock includes the crucial character word stock of different resource type;Matching module 620 are additionally operable to the crucial character word stock of different resource type be matched text resource data respectively;When this Include the keyword in the crucial character word stock of certain resource type in textual resources data, then text number of resources Match according to the crucial character word stock with the resource type.
For example, crucial character word stock includes promoting series advertisements dictionary and pornographic series advertisements dictionary.By textual resources number Matched according to distribution series advertisements dictionary, if comprising the pass promoted in series advertisements dictionary in textual resources data Key word, then text resource data with promote series advertisements dictionary match.
In one embodiment, the contact method feature database includes contact method dictionary;
Matching module 620 is additionally operable to be matched text resource data with the contact method dictionary.
Specifically, textual resources data are matched with contact method dictionary, if being wrapped in textual resources data Data in dictionary containing contact method, then textual resources data match with contact method dictionary, otherwise, text This resource data is mismatched with contact method dictionary.
In one embodiment, the contact method feature database includes contact method dictionary and continuation character quantity threshold Value;
Matching module 620 is additionally operable to be matched text resource data with the contact method dictionary, and Judge to be more than or equal to continuation character amount threshold with the presence or absence of continuation character quantity in text resource data;
When text resource data and the contact method dictionary match and exist in text resource data continuous Character quantity is more than or equal to continuation character amount threshold, then text resource data and the contact method feature Storehouse matches, and otherwise text resource data is mismatched with the contact method feature database.
Contact method feature database includes contact method dictionary and continuation character amount threshold, is provided by by text Source data is matched with contact method dictionary, and judges to whether there is continuation character number in textual resources data Amount is more than or equal to continuation character amount threshold, when textual resources data match and deposit with contact method dictionary It is more than or equal to continuation character amount threshold in continuation character quantity, can more accurately detects target text Resource data.
Fig. 7 is the structured flowchart of another embodiment Chinese version resource data detection means.As shown in fig. 7, A kind of textual resources data detection device, except including acquisition module 610, matching module 620 and screening module 630, also including pretreatment module 640, word-dividing mode 650 and filtering module 660.Wherein:
Pretreatment module 640 is used for after the acquisition textual resources data, and text resource data is carried out Pretreatment, obtains pretreated textual resources data.
Specifically, textual resources data are pre-processed, remove the interference informations such as expression, punctuation mark, The text informations such as remaining Chinese, numeral, letter, that is, obtain pretreated textual resources data.
For expression, expression directly can be replaced by sky using character string, so as to remove expression.Remove expression Afterwards, pattern match can be carried out using regular expression, you can remaining Chinese, numeral, letter etc., obtains pre- Textual resources data after treatment.
Word-dividing mode 650 is used to for pretreated textual resources data to carry out word segmentation processing, obtains the text The participle of this resource data.Matching module 620 is additionally operable to the participle of textual resources data and crucial character word stock Matched, and the participle of text resource data is matched with contact method feature database.
Filtering module 660 is used to carry out filtration treatment to promotional literature resource data.
Additionally, word-dividing mode 650 also to the participle of textual resources data can classify obtains the type of participle. Matching module 620 is additionally operable to be matched with keywords database or contact method feature database according to the type of participle. The type of participle can be keyword type or contact method characteristic type.
Above-mentioned textual resources data detection device is applied in browser or application program, target text number of resources According to being advertisement text resource data, then filtering module 660 is additionally operable in browser or application program to advertisement Textual resources data carry out filtration treatment.
Specifically, browser can be various web browsers etc..Application program can be to provide answering for various services With APP etc..
Above-mentioned textual resources data detection device, pre-processes by the textual resources data for obtaining, can Remove interference information, improve the accuracy of matching, then by pretreated textual resources data and crucial words Storehouse is matched, and textual resources data are matched with contact method feature database, when textual resources number Match according to crucial character word stock and contact method feature database, then text resource data is target text money Source data, in this way, matching the matching with contact method by crucial character word stock, can accurately identify mesh Mark textual resources data, are matched compared to simple crucial character word stock, and accuracy is high, reduces erroneous judgement Rate.
One of ordinary skill in the art will appreciate that all or part of flow in realizing above-described embodiment method, Computer program be can be by instruct the hardware of correlation to complete, it is non-easy that described program can be stored in one In the property lost computer read/write memory medium, the program is upon execution, it may include such as the implementation of above-mentioned each method The flow of example.Wherein, described storage medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) etc..
Embodiment described above only expresses several embodiments of the invention, and its description is more specific and detailed, But therefore can not be interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for this area Those of ordinary skill for, without departing from the inventive concept of the premise, can also make it is some deformation and Improve, these belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be with appended Claim is defined.

Claims (12)

1. a kind of textual resources data detection method, comprises the following steps:
Obtain textual resources data;
The textual resources data are matched with crucial character word stock, and by the textual resources data with Contact method feature database is matched;
When the textual resources data match with the crucial character word stock and the textual resources data with it is described When contact method feature database matches, the textual resources data are target text resource data.
2. method according to claim 1, it is characterised in that the crucial character word stock includes different moneys The crucial character word stock of Source Type;
The step of textual resources data are matched with crucial character word stock includes:
The textual resources data are matched respectively with the crucial character word stock of different resource type;
Keyword in the crucial character word stock for including certain resource type in the textual resources data, then institute Textual resources data are stated to match with the crucial character word stock of the resource type.
3. method according to claim 1, it is characterised in that the contact method feature database includes connection It is mode dictionary;
The step of textual resources data are matched with contact method feature database includes:
The textual resources data are matched with the contact method dictionary.
4. method according to claim 1, it is characterised in that the contact method feature database includes connection It is mode dictionary and continuation character amount threshold;
The step of textual resources data are matched with contact method feature database includes:
The textual resources data are matched with the contact method dictionary, and is judged the text money It is more than or equal to continuation character amount threshold with the presence or absence of continuation character quantity in source data;
When the textual resources data and the contact method dictionary match and are deposited in the textual resources data Continuation character quantity be more than or equal to continuation character amount threshold, then the textual resources data with it is described It is that mode feature database matches, otherwise described textual resources data are mismatched with the contact method feature database.
5. method according to claim 1, it is characterised in that in the acquisition textual resources data After step, methods described also includes:
The textual resources data are pre-processed, pretreated textual resources data are obtained;
Pretreated textual resources data are carried out into word segmentation processing, the participle of the textual resources data is obtained;
It is described that the textual resources data are matched with crucial character word stock, and by the textual resources number Include according to the step of matching with contact method feature database:
The participle of textual resources data is matched with crucial character word stock, and by the textual resources data Participle matched with contact method feature database.
6. method according to any one of claim 1 to 5, it is characterised in that methods described application In browser or application program, the target text resource data is advertisement text resource data;
Methods described also includes:
Filtration treatment is carried out to the promotional literature resource data in browser or application program.
7. a kind of textual resources data detection device, it is characterised in that including:
Acquisition module, for obtaining textual resources data;
Matching module, for the textual resources data to be matched with crucial character word stock, and will be described Textual resources data are matched with default contact method feature database;
Screening module, for matching with the crucial character word stock and the text when the textual resources data When resource data matches with the contact method feature database, the textual resources data are target text resource Data.
8. device according to claim 7, it is characterised in that the default crucial character word stock includes The crucial character word stock of different resource type;
The matching module is additionally operable to the crucial character word stock point of the textual resources data and different resource type Do not matched;Key in the crucial character word stock for including certain resource type in the textual resources data Word, then the textual resources data match with the crucial character word stock of the resource type.
9. device according to claim 7, it is characterised in that the contact method feature database includes connection It is mode dictionary;
The matching module is additionally operable to be matched the textual resources data with the contact method dictionary.
10. device according to claim 7, it is characterised in that the contact method feature database includes Contact method dictionary and continuation character amount threshold;
The matching module is additionally operable to be matched the textual resources data with the contact method dictionary, And judge to be more than or equal to continuation character quantity with the presence or absence of continuation character quantity in the textual resources data Threshold value;
When the textual resources data and the contact method dictionary match and are deposited in the textual resources data Continuation character quantity be more than or equal to continuation character amount threshold, then the textual resources data with it is described It is that mode feature database matches, otherwise described textual resources data are mismatched with the contact method feature database.
11. devices according to claim 7, it is characterised in that described device also includes:
Pretreatment module, for after the acquisition textual resources data, entering to the textual resources data Row pretreatment, obtains pretreated textual resources data;
Word-dividing mode, for pretreated textual resources data to be carried out into word segmentation processing, obtains the text The participle of resource data;
The matching module is additionally operable to be matched the participle of textual resources data with crucial character word stock, and The participle of the textual resources data is matched with contact method feature database.
12. device according to any one of claim 7 to 11, it is characterised in that described device should In for browser or application program, the target text resource data is advertisement text resource data;
Described device also includes:
Filtering module, for being filtered to the promotional literature resource data in browser or application program Treatment.
CN201510859762.1A 2015-11-30 2015-11-30 Textual resources data detection method and device Pending CN106815242A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510859762.1A CN106815242A (en) 2015-11-30 2015-11-30 Textual resources data detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510859762.1A CN106815242A (en) 2015-11-30 2015-11-30 Textual resources data detection method and device

Publications (1)

Publication Number Publication Date
CN106815242A true CN106815242A (en) 2017-06-09

Family

ID=59155794

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510859762.1A Pending CN106815242A (en) 2015-11-30 2015-11-30 Textual resources data detection method and device

Country Status (1)

Country Link
CN (1) CN106815242A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399161A (en) * 2018-03-06 2018-08-14 平安科技(深圳)有限公司 Advertising pictures identification method, electronic device and readable storage medium storing program for executing

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102208992A (en) * 2010-06-13 2011-10-05 天津海量信息技术有限公司 Internet-facing filtration system of unhealthy information and method thereof
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN102572745A (en) * 2010-12-24 2012-07-11 中国移动通信集团上海有限公司 Method and device for determining waste short message
US20130173562A1 (en) * 2004-02-11 2013-07-04 Joshua Alspector Simplifying Lexicon Creation in Hybrid Duplicate Detection and Inductive Classifier System
CN104184653A (en) * 2014-07-28 2014-12-03 小米科技有限责任公司 Message filtering method and device
CN104462509A (en) * 2014-12-22 2015-03-25 北京奇虎科技有限公司 Review spam detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130173562A1 (en) * 2004-02-11 2013-07-04 Joshua Alspector Simplifying Lexicon Creation in Hybrid Duplicate Detection and Inductive Classifier System
CN102208992A (en) * 2010-06-13 2011-10-05 天津海量信息技术有限公司 Internet-facing filtration system of unhealthy information and method thereof
CN102572745A (en) * 2010-12-24 2012-07-11 中国移动通信集团上海有限公司 Method and device for determining waste short message
CN102567534A (en) * 2011-12-31 2012-07-11 凤凰在线(北京)信息技术有限公司 Interactive product user generated content intercepting system and intercepting method for the same
CN104184653A (en) * 2014-07-28 2014-12-03 小米科技有限责任公司 Message filtering method and device
CN104462509A (en) * 2014-12-22 2015-03-25 北京奇虎科技有限公司 Review spam detection method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108399161A (en) * 2018-03-06 2018-08-14 平安科技(深圳)有限公司 Advertising pictures identification method, electronic device and readable storage medium storing program for executing

Similar Documents

Publication Publication Date Title
US11176142B2 (en) Method of data query based on evaluation and device
US9442905B1 (en) Detecting neighborhoods from geocoded web documents
CN109325179B (en) Method and device for promoting content
CN102473190B (en) Assign keywords to web pages
CN112347767B (en) Text processing method, device and equipment
CN106383887A (en) Environment-friendly news data acquisition and recommendation display method and system
JP6872258B2 (en) A recording medium that can be read by a computer that embodies the Internet content providing server and its method.
US20130246520A1 (en) Recognizing Social Media Posts, Comments, or other Texts as Business Recommendations or Referrals
Fuad et al. Analysis and classification of mobile apps using topic modeling: A case study on Google Play Arabic apps
CN112506981A (en) Online training service pushing method and device
CN107766398A (en) For the method, apparatus and data handling system for image is matched with content item
CN106383862A (en) Violation short message detection method and system
CN114357335A (en) Information acquisition method, medium, device and computing equipment
CN101425981A (en) Information publishing system and method for publishing information according to mutual exclusive indication
KR101606758B1 (en) Issue data extracting method and system using relevant keyword
WO2021189766A1 (en) Data visualization method and related device
CN107943906A (en) Information collection and display method and device
CN111414523A (en) Data acquisition method and device
CN108256078B (en) Information acquisition method and device
US20160162930A1 (en) Associating Social Comments with Individual Assets Used in a Campaign
CN111383072A (en) User credit scoring method, storage medium and server
CN111553487B (en) Business object identification method and device
JP2017004260A (en) Information processing apparatus, information processing method, and information processing program
CN106815242A (en) Textual resources data detection method and device
CN109074365B (en) Parameterizing network communication paths

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20170609

RJ01 Rejection of invention patent application after publication