CN106815242A - Textual resources data detection method and device - Google Patents
Textual resources data detection method and device Download PDFInfo
- Publication number
- CN106815242A CN106815242A CN201510859762.1A CN201510859762A CN106815242A CN 106815242 A CN106815242 A CN 106815242A CN 201510859762 A CN201510859762 A CN 201510859762A CN 106815242 A CN106815242 A CN 106815242A
- Authority
- CN
- China
- Prior art keywords
- textual resources
- data
- resources data
- contact method
- matched
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The present invention relates to a kind of textual resources data detection method and device.The described method comprises the following steps:Obtain textual resources data;Textual resources data are matched respectively with crucial character word stock and contact method feature database;When the textual resources data match with the crucial character word stock and the textual resources data match with the contact method feature database, the textual resources data are target text resource data.Above-mentioned textual resources data detection method and device, matched respectively with crucial character word stock and contact method feature database by by the textual resources data of acquisition, when textual resources data match with crucial character word stock and contact method feature database, then text resource data is target text resource data, so, matching with contact method is matched by crucial character word stock, target text resource data can accurately be identified, matched compared to simple crucial character word stock, accuracy is high, reduces False Rate.
Description
Technical field
The present invention relates to Data Detection field, more particularly to a kind of textual resources data detection method and device.
Background technology
With the development of Internet technology, the life for giving people brings great convenience, but webpage quantity
Rapid growth and UGC (User Generated Content, user's original content) increase and also bring
The problem of information overload.Resource data required for how effectively screening or filter out from the data of magnanimity
Or filter out unwanted resource data and seem extremely important to facilitate user to browse.
For with the topic circle Commentary Systems of social networks, generally there is the resource data publisher of various regions at this
Resource data issue is carried out on platform, therefore the platform needs rapidly to recognize these resource datas, and determine
Whether these resource datas are filtered out, to safeguard an environment for the browsing content of health.Traditional number of resources
According to the mode of detection typically by search key, the resource data matched with crucial character word stock is considered as
The resource data of required detection, but the Semantic of the text message in resource data is stronger in some cases,
Under given conditions, the expressed meaning is completely different for some vocabulary, and matching inspection is carried out using keyword
Survey, False Rate is higher.
The content of the invention
Based on this, it is necessary to in traditional resource data detection method using keyword match False Rate compared with
A kind of problem high, there is provided textual resources data detection method, can reduce False Rate.
Additionally, there is a need to a kind of textual resources data detection device of offer, False Rate can be reduced.
A kind of textual resources data detection method, comprises the following steps:
Obtain textual resources data;
The textual resources data are matched with crucial character word stock, and by the textual resources data with
Contact method feature database is matched;
When the textual resources data match with the crucial character word stock and the textual resources data with it is described
When contact method feature database matches, the textual resources data are target text resource data.
A kind of textual resources data detection device, including:
Acquisition module, for obtaining textual resources data;
Matching module, for the textual resources data to be matched with crucial character word stock, and will be described
Textual resources data are matched with contact method feature database;
Screening module, for matching with the crucial character word stock and the text when the textual resources data
When resource data matches with the contact method feature database, the textual resources data are target text resource
Data.
Above-mentioned textual resources data detection method and device, by the textual resources data and keyword that will obtain
Dictionary is matched, and textual resources data are matched with contact method feature database, works as textual resources
Data match with crucial character word stock and contact method feature database, then text resource data is target text
Resource data, in this way, matching the matching with contact method by crucial character word stock, can accurately identify
Target text resource data, is matched compared to simple crucial character word stock, and accuracy is high, reduces mistake
Sentence rate.
Brief description of the drawings
Figure 1A is the internal structure schematic diagram of terminal in one embodiment;
Figure 1B is the internal structure schematic diagram of server in one embodiment;
Fig. 2 is the flow chart of one embodiment Chinese version resource data detection method;
Fig. 3 is the flow chart of another embodiment Chinese version resource data detection method;
Fig. 4 is the schematic diagram of the information issued on one embodiment Chinese version resource data distribution platform;
Fig. 5 is the schematic diagram of the information that advertisement text resource data is detected from Fig. 4;
Fig. 6 is the structured flowchart of one embodiment Chinese version resource data detection means;
Fig. 7 is the structured flowchart of another embodiment Chinese version resource data detection means.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, below in conjunction with accompanying drawing and reality
Example is applied, the present invention will be described in further detail.It should be appreciated that specific embodiment described herein is only
Only it is used to explain the present invention, is not intended to limit the present invention.
Figure 1A is the internal structure schematic diagram of terminal in one embodiment.As shown in Figure 1A, the terminal includes logical
Cross processor, non-volatile memory medium, internal memory, network interface, the display screen and defeated of system bus connection
Enter device.Wherein, the operating system that is stored with the non-volatile memory medium of terminal and textual resources data are examined
Device is surveyed, text resource data detection means is used to realize a kind of textual resources data detection method.Terminal
On can also be stored with crucial character word stock and contact method feature database etc..The processor is used to provide calculating and controls
Ability, supports the operation of whole terminal, and is arranged to perform textual resources data detection method.Eventually
The operation that the textual resources data detection device in non-volatile memory medium is saved as in end provides environment,
Network interface is used to carry out network service with server, such as sends data access request to server, receives clothes
Data that business device is returned etc..The display screen of terminal can be LCDs or electric ink display screen etc.,
Input unit can be button, the rail set on the touch layer, or terminal enclosure covered on display screen
Mark ball or Trackpad, or external keyboard, Trackpad or mouse etc..The terminal can be mobile phone,
Panel computer or personal digital assistant etc..It will be understood by those skilled in the art that the structure shown in Figure 1A,
Only the block diagram of the part-structure related to application scheme, does not constitute and application scheme is applied to
The restriction of terminal thereon, specific terminal can include than more or less part shown in figure, or
Some parts are combined, or is arranged with different parts.
Figure 1B is the internal structure schematic diagram of server in one embodiment.As shown in Figure 1B, the server bag
Processor, storage medium, internal memory, display screen, input unit and the network connected by system bus is included to connect
Mouthful.Wherein, the non-volatile memory medium of the server is stored with operating system, database and textual resources
Data detection device, be stored with crucial character word stock and contact method feature database etc. in database, text resource
Data detection device is used to realize a kind of textual resources data detection method.The processor of the server is used to carry
For calculating and control ability, the operation of whole server is supported, and be arranged to perform textual resources data
Detection method.The operation that the textual resources data detection device in storage medium is saved as in the server is provided
Environment.The display screen of the server can be LCDs or electric ink display screen etc., input unit
Can be button, the trace ball or tactile set on the touch layer, or terminal enclosure covered on display screen
Control plate, or external keyboard, Trackpad or mouse etc..The network interface of the server is used for according to this
Communicate by network connection with outside terminal, such as the data access request of receiving terminal transmission and to end
End returned data etc..Server can with independent server or multiple server groups into server set
Group realizes.It will be understood by those skilled in the art that the structure shown in Figure 1B, only with the application side
The block diagram of the related part-structure of case, does not constitute the limit of the server being applied thereon to application scheme
Fixed, specific server can include than more or less part shown in figure, or combine some parts,
Or arranged with different parts.
It should be noted that the textual resources data detection method that the present invention is provided can be applied to terminal or service
On device.
Fig. 2 is the flow chart of one embodiment Chinese version resource data detection method.As shown in Fig. 2 a kind of
Textual resources data detection method, comprises the following steps:
Step 202, obtains textual resources data.
Specifically, textual resources data can be any textual resources data obtained from network, such as news
Data, comment data, product data etc. or the arbitrary picture resource data obtained from network, then
Textual resources data for being identified being converted to picture resource data etc., or from network obtain appoint
The audio or video resource data of meaning, the textual resources number being converted to audio or video resource data
According to.Resource data can be captured from the network data of magnanimity by resource gripping tool.
Step 204, text resource data is matched with crucial character word stock, and by text number of resources
Matched according to contact method feature database.
Specifically, crucial character word stock is the set of the keyword of the advance a certain class resource characteristicses of representative collected.
The corresponding crucial character word stock that can be respectively set up according to different resource type.Crucial character word stock may include a certain
The one kind such as crucial character word stock, distribution series advertisements key character word stock, the pornographic series advertisements key character word stock of class article
Or it is various.
For example in promoting the crucial character word stock of series advertisements there be keyword:Part-time, Amalgamation-becoming duties, money paid for odd jobs, high salary, firewood
Provide, make money, earning money, day ties, runs a shop, cover treasured, Tao treasured, New-Light treasured, cry loudly treasured, grape treasured, Tao Bao, hide it is precious,
Tao is precious, flood is precious, cover is precious, draw that precious, great waves are precious, it is precious to escape, draw guarantor, peach treasured, nine foldings, eight foldings, seven foldings, six
Folding, five foldings, four fold, three foldings, eighty percent discount, a folding, shop, Taobao, pocket money, employ sincerely, business personnel,
Alipay, in unlimited time, do not limit that place, time do not limit, place does not limit, taobao, net purchase, the free time, trick
Engage, base pay, cash pledge, advertisement, agency, it is part-time, and ear only, earn money, 1 folding, 2 foldings, 3 foldings, 4
Folding, 5 foldings, 6 foldings, 7 foldings, 8 foldings, 9 foldings, take in, say knot, Gong Capital, being in can do, hourly worker, have
Computer, wallet, full-time mother, wholesale etc..
There is keyword in the crucial character word stock of pornographic series advertisements:Ripe female, young married woman, it is plentiful, tempt, commit incest.
Textual resources data are matched with crucial character word stock, if including keyword in textual resources data
Keyword in dictionary, then textual resources data match with crucial character word stock, otherwise, textual resources data
Do not matched that with crucial character word stock.
Contact method feature database may include contact method dictionary, or, it may include contact method dictionary and continuous
Character quantity threshold value.Contact method dictionary is the set of the keyword of the advance representative contact method collected.Connection
Be may include in mode dictionary to browse, contact, scratch, button button, Qiu Qiu, wechat, common vetch letter, cocoa, net
Stand, have mobile phone, have computer, support computer, support mobile phone, support Android, support android, see above
Name, see above title, chatroom, space, button space, q photograph albums, buckle photograph album, word remove, kowtow,
Prestige, V believe, add group ,+skirt, customer service, V new, Pull Pull, Wei letter, consulting, understanding details, download ground
Location, private chat, be letter etc..Continuation character amount threshold can be arranged as required to, for example continuation character quantity threshold
Be worth is 5,6,7 etc..
Textual resources data are matched with contact method feature database, if comprising contact in textual resources data
Data in mode feature database, then textual resources data match with contact method feature database, if textual resources number
Not comprising data in contact method feature database in, then textual resources data are mismatched with contact method feature database.
It should be noted that text resource data is matched with crucial character word stock, and by the text
Resource data carries out matching no sequencing with contact method feature database, can perform side by side, also can one
Performed before another.
Step 206, when text resource data and the crucial character word stock match and text resource data with should
When contact method feature database matches, text resource data is target text resource data.
Specifically, when textual resources data match simultaneously with crucial character word stock and contact method feature database,
Text resource data is then target text resource data.The thing that target text resource data can be screened for needed for
Product data or advertisement text resource data etc..If target text resource data is advertisement text resource data,
Advertisement text resource data can be filtered.
Above-mentioned textual resources data detection method, enters by by the textual resources data of acquisition with crucial character word stock
Row matching, and textual resources data is matched with contact method feature database, when textual resources data and
Crucial character word stock and contact method feature database match, then text resource data is target text number of resources
According in this way, by the matching of crucial character word stock and the matching of contact method, can accurately identify target text
This resource data, is matched compared to simple crucial character word stock, and accuracy is high, reduces False Rate.
In one embodiment, crucial character word stock includes the crucial character word stock of different resource type;
The step of text resource data is matched with crucial character word stock includes:
Text resource data is matched respectively with the crucial character word stock of different resource type;
Keyword in the crucial character word stock for including certain resource type in text resource data, the then money
Source data matches with the crucial character word stock of the resource type.
For example, crucial character word stock includes promoting series advertisements dictionary and pornographic series advertisements dictionary.By textual resources number
Matched according to distribution series advertisements dictionary, if comprising the pass promoted in series advertisements dictionary in textual resources data
Key word, then text resource data with promote series advertisements dictionary match.
In one embodiment, contact method feature database includes contact method dictionary;Then by text number of resources
Include according to the step of matching with contact method feature database:By text resource data and the contact method word
Storehouse is matched.
Specifically, textual resources data are matched with contact method dictionary, if being wrapped in textual resources data
Data in dictionary containing contact method, then textual resources data match with contact method dictionary, otherwise, text
This resource data is mismatched with contact method dictionary.
In one embodiment, the contact method feature database includes contact method dictionary and continuation character quantity threshold
Value;The step of then text resource data is matched with contact method feature database includes:
Text resource data is matched with the contact method dictionary, and is judged text resource data
In with the presence or absence of continuation character quantity be more than or equal to continuation character amount threshold;
When text resource data and the contact method dictionary match and exist in text resource data continuous
Character quantity is more than or equal to continuation character amount threshold, then text resource data and the contact method feature
Storehouse matches, and otherwise text resource data is mismatched with the contact method feature database.
Specifically, continuation character amount threshold can be arranged as required to, and such as 5,6,7.
Contact method feature database includes contact method dictionary and continuation character amount threshold, is provided by by text
Source data is matched with contact method dictionary, and judges to whether there is continuation character number in textual resources data
Amount is more than or equal to continuation character amount threshold, when textual resources data match and deposit with contact method dictionary
It is more than or equal to continuation character amount threshold in continuation character quantity, can more accurately detects target text
Resource data.
Fig. 3 is the flow chart of another embodiment Chinese version resource data detection method.As shown in figure 3, one
Textual resources data detection method is planted, is comprised the following steps:
Step 302, obtains textual resources data.
Specifically, textual resources data can be any textual resources data obtained from network, such as news
Data, comment data, product data etc. or the arbitrary picture resource data obtained from network, then
Textual resources data for being identified being converted to picture resource data etc., or from network obtain appoint
The audio or video resource data of meaning, the textual resources number being converted to audio or video resource data
According to.Resource data can be captured from the network data of magnanimity by resource gripping tool.
Step 304, pre-processes to text resource data, obtains pretreated textual resources data.
Specifically, textual resources data are pre-processed, remove the interference informations such as expression, punctuation mark,
The text informations such as remaining Chinese, numeral, letter, that is, obtain pretreated textual resources data.
For expression, expression directly can be replaced by sky using character string, so as to remove expression.Remove expression
Afterwards, pattern match can be carried out using regular expression, you can remaining Chinese, numeral, letter etc., obtains pre-
Textual resources data after treatment.
Pretreated textual resources data are carried out word segmentation processing by step 306, obtain text resource data
Participle.
Specifically, Chinese in pretreated textual resources data is carried out into word segmentation processing, obtains different dividing
Word phrase, for example, carry out word segmentation processing and obtain to " I am Chinese, and family is in Shenzhen ", " I ", " Chinese ",
The participle phrases such as " family ", " Shenzhen ".
Step 308, the participle of textual resources data is matched with crucial character word stock, and by textual resources
The participle of data is matched with contact method feature database.
Specifically, crucial character word stock is the set of the keyword of the advance a certain class resource characteristicses of representative collected.
The corresponding crucial character word stock that can be respectively set up according to different resource type.Crucial character word stock may include a certain
The one kind such as crucial character word stock, distribution series advertisements key character word stock, the pornographic series advertisements key character word stock of class article
Or it is various.
For example in promoting the crucial character word stock of series advertisements there be keyword:Part-time, Amalgamation-becoming duties, money paid for odd jobs etc..
There is keyword in the crucial character word stock of pornographic series advertisements:Ripe female, young married woman, it is plentiful, tempt, commit incest.
The participle of pretreated textual resources data is matched with crucial character word stock, if textual resources number
According to participle in include keyword in crucial character word stock, then pretreated textual resources data with it is crucial
Character word stock matches, and otherwise, pretreated textual resources data are not matched that with crucial character word stock.
Contact method feature database may include contact method dictionary, or, it may include contact method dictionary and continuous
Character quantity threshold value.Contact method dictionary is the set of the keyword of the advance representative contact method collected.Connection
Be may include in mode dictionary to browse, contact, scratch, button button, Qiu Qiu, wechat, common vetch letter, cocoa, net
Stand, have mobile phone, have computer, support computer, support mobile phone, support Android, support android, see above
Name, see above title, chatroom, space, button space, q photograph albums, buckle photograph album, word remove, kowtow,
Prestige, V believe, add group ,+skirt, customer service, V new, Pull Pull, Wei letter, consulting, understanding details, download ground
Location, private chat, be letter etc..Continuation character amount threshold can be arranged as required to, for example continuation character quantity threshold
Be worth is 5,6,7 etc..
The participle of pretreated textual resources data is matched with contact method feature database, if pretreatment
Data in contact method feature database are included in textual resources data afterwards, then pretreated textual resources data
Match with contact method feature database, if not including contact method feature in pretreated textual resources data
Data in storehouse, then pretreated textual resources data and contact method feature database are mismatched.
It should be noted that pretreated textual resources data are matched with crucial character word stock, and
Pretreated textual resources data and contact method feature database are carried out matching no sequencing, can be arranged side by side
Perform, also can one perform before another.
Additionally, can also the participle of textual resources data classify that the type of participle be obtained, according to participle
Type is matched with keywords database or contact method feature database.The type of participle can be keyword type or connection
It is mode characteristic type.
Step 310, when participle and the crucial character word stock of textual resources data match and textual resources data
When participle matches with the contact method feature database, text resource data is target text resource data.
Specifically, when the participle and crucial character word stock and contact method feature of pretreated textual resources data
Storehouse matches simultaneously when, text resource data is then target text resource data.Target text resource data
The product data or advertisement text resource data that can be screened for needed for etc..If target text resource data is wide
Textual resources data are accused, advertisement text resource data can be filtered.
Additionally, when target text resource data is advertisement text resource data, after step 310, may be used also
Including:Filtration treatment is carried out to advertisement text resource data.
Above-mentioned textual resources data detection method, pre-processes by the textual resources data for obtaining, can
Remove interference information, improve the accuracy of matching, then by pretreated textual resources data and crucial words
Storehouse is matched, and textual resources data are matched with contact method feature database, when textual resources number
Match according to crucial character word stock and contact method feature database, then text resource data is target text money
Source data, in this way, matching the matching with contact method by crucial character word stock, can accurately identify mesh
Mark textual resources data, are matched compared to simple crucial character word stock, and accuracy is high, reduces erroneous judgement
Rate;And there is high efficiency, low-cost, fast and flexible.
In one embodiment, above-mentioned file resource data method is applied in browser or application program, mesh
Mark textual resources data are advertisement text resource data, then above-mentioned file resource data method also includes:Clear
Look in device or application program carries out filtration treatment to advertisement text resource data.
Specifically, browser can be various web browsers etc..Application program can be to provide answering for various services
With APP etc..
Above-mentioned textual resources data detection method can be applied to various ad distribution platforms and carry out purposes of commercial detection, inspection
Advertisement text filtering is carried out after having surveyed, can also be applied to third party website carries out advertisement text or pornography
Detection and filtering etc..The application of textual resources data detection method is described with a specific application scenarios below
Principle.In this application scene, textual resources data are the comment information in textual resources data publishing platform.
Target text resource data is advertisement text resource data.
As shown in figure 4, issue has a plurality of information on resource data distribution platform, such as user A issues
" the newest film shown, mission spy 5, terminator:Genesis etc., only need 5 yuan and may be viewed by, prestige
ys1111”;" bidding for the Winter Olympic games successfully in Beijing " of user B issues;User C issue " one is earned
The chance of extra income, you are ready to holdEarning dozens of yuan within one day has no problem, as long as you have the time!It is ready to pay
Go out.Interested parties add q510100 ".
Issued in textual resources data publishing platform by by above-mentioned textual resources data detection method, being obtained
A plurality of information, the textual resources data of each user issue are matched with advertisement keyword character word stock,
The textual resources data by each user's issue are matched with contact method feature database again, obtain user A
The textual resources data issued with user C match with advertisement keyword character word stock and contact method feature database, then
The textual resources data of user A and user C issues are advertisement text resource data, as shown in figure 5, detection
The textual resources data for going out user A and user C issues are advertisement text resource data.Detect user A and
The textual resources data of user C issues are after advertisement text resource data, can to carry out filtration treatment, after filtering
The textual resources data that this is filtered out will not be shown on receiving side terminal.
Fig. 6 is the structured flowchart of one embodiment Chinese version resource data detection means.As shown in fig. 6, one
Plant textual resources data detection device, including acquisition module 610, matching module 620 and screening module 630.
Wherein:
Acquisition module 610 is used to obtain textual resources data.
Specifically, textual resources data can be any textual resources data obtained from network, such as news
Data, comment data, product data etc..Can be captured from the network data of magnanimity by resource gripping tool
Textual resources data.
Matching module 620 is used to be matched text resource data with key character word stock, and by this article
This resource data is matched with contact method feature database.
Specifically, crucial character word stock is the set of the keyword of the advance a certain class resource characteristicses of representative collected.
The corresponding crucial character word stock that can be respectively set up according to different resource type.Crucial character word stock may include a certain
The one kind such as crucial character word stock, distribution series advertisements key character word stock, the pornographic series advertisements key character word stock of class article
Or it is various.
For example in promoting the crucial character word stock of series advertisements there be keyword:Part-time, Amalgamation-becoming duties, money paid for odd jobs, high salary, firewood
Provide, make money, earning money, day ties, runs a shop, cover treasured, Tao treasured, New-Light treasured, cry loudly treasured, grape treasured, Tao Bao, hide it is precious,
Tao is precious, flood is precious, cover is precious, draw that precious, great waves are precious, it is precious to escape, draw guarantor, peach treasured, nine foldings, eight foldings, seven foldings, six
Folding, five foldings, four fold, three foldings, eighty percent discount, a folding, shop, Taobao, pocket money, employ sincerely, business personnel,
Alipay, in unlimited time, do not limit that place, time do not limit, place does not limit, taobao, net purchase, the free time, trick
Engage, base pay, cash pledge, advertisement, agency, it is part-time, and ear only, earn money, 1 folding, 2 foldings, 3 foldings, 4
Folding, 5 foldings, 6 foldings, 7 foldings, 8 foldings, 9 foldings, take in, say knot, Gong Capital, being in can do, hourly worker, have
Computer, wallet, full-time mother, wholesale etc..
There is keyword in the crucial character word stock of pornographic series advertisements:Ripe female, young married woman, it is plentiful, tempt, commit incest.
Textual resources data are matched with crucial character word stock, if including keyword in textual resources data
Keyword in dictionary, then textual resources data match with crucial character word stock, otherwise, textual resources data
Do not matched that with crucial character word stock.
Contact method feature database may include contact method dictionary, or, it may include contact method dictionary and continuous
Character quantity threshold value.Contact method dictionary is the set of the keyword of the advance representative contact method collected.Connection
Be may include in mode dictionary to browse, contact, scratch, button button, Qiu Qiu, wechat, common vetch letter, cocoa, net
Stand, have mobile phone, have computer, support computer, support mobile phone, support Android, support android, see above
Name, see above title, chatroom, space, button space, q photograph albums, buckle photograph album, word remove, kowtow,
Prestige, V believe, add group ,+skirt, customer service, V new, Pull Pull, Wei letter, consulting, understanding details, download ground
Location, private chat, be letter etc..Continuation character amount threshold can be arranged as required to, for example continuation character quantity threshold
Be worth is 5,6,7 etc..
Textual resources data are matched with contact method feature database, if comprising contact in textual resources data
Data in mode feature database, then textual resources data match with contact method feature database, if textual resources number
Not comprising data in contact method feature database in, then textual resources data are mismatched with contact method feature database.
Screening module 630 is used to match and text resource when text resource data and the crucial character word stock
When data match with the contact method feature database, text resource data is target text resource data.
Specifically, when textual resources data match simultaneously with crucial character word stock and contact method feature database,
Text resource data is then target text resource data.The thing that target text resource data can be screened for needed for
Product data or advertisement text resource data etc..If target text resource data is advertisement text resource data,
Advertisement text resource data can be filtered.
Above-mentioned textual resources data detection device, enters by by the textual resources data of acquisition with crucial character word stock
Row matching, and textual resources data is matched with contact method feature database, when textual resources data and
Crucial character word stock and contact method feature database match, then text resource data is target text number of resources
According in this way, by the matching of crucial character word stock and the matching of contact method, can accurately identify target text
This resource data, is matched compared to simple crucial character word stock, and accuracy is high, reduces False Rate.
In one embodiment, crucial character word stock includes the crucial character word stock of different resource type;Matching module
620 are additionally operable to the crucial character word stock of different resource type be matched text resource data respectively;When this
Include the keyword in the crucial character word stock of certain resource type in textual resources data, then text number of resources
Match according to the crucial character word stock with the resource type.
For example, crucial character word stock includes promoting series advertisements dictionary and pornographic series advertisements dictionary.By textual resources number
Matched according to distribution series advertisements dictionary, if comprising the pass promoted in series advertisements dictionary in textual resources data
Key word, then text resource data with promote series advertisements dictionary match.
In one embodiment, the contact method feature database includes contact method dictionary;
Matching module 620 is additionally operable to be matched text resource data with the contact method dictionary.
Specifically, textual resources data are matched with contact method dictionary, if being wrapped in textual resources data
Data in dictionary containing contact method, then textual resources data match with contact method dictionary, otherwise, text
This resource data is mismatched with contact method dictionary.
In one embodiment, the contact method feature database includes contact method dictionary and continuation character quantity threshold
Value;
Matching module 620 is additionally operable to be matched text resource data with the contact method dictionary, and
Judge to be more than or equal to continuation character amount threshold with the presence or absence of continuation character quantity in text resource data;
When text resource data and the contact method dictionary match and exist in text resource data continuous
Character quantity is more than or equal to continuation character amount threshold, then text resource data and the contact method feature
Storehouse matches, and otherwise text resource data is mismatched with the contact method feature database.
Contact method feature database includes contact method dictionary and continuation character amount threshold, is provided by by text
Source data is matched with contact method dictionary, and judges to whether there is continuation character number in textual resources data
Amount is more than or equal to continuation character amount threshold, when textual resources data match and deposit with contact method dictionary
It is more than or equal to continuation character amount threshold in continuation character quantity, can more accurately detects target text
Resource data.
Fig. 7 is the structured flowchart of another embodiment Chinese version resource data detection means.As shown in fig. 7,
A kind of textual resources data detection device, except including acquisition module 610, matching module 620 and screening module
630, also including pretreatment module 640, word-dividing mode 650 and filtering module 660.Wherein:
Pretreatment module 640 is used for after the acquisition textual resources data, and text resource data is carried out
Pretreatment, obtains pretreated textual resources data.
Specifically, textual resources data are pre-processed, remove the interference informations such as expression, punctuation mark,
The text informations such as remaining Chinese, numeral, letter, that is, obtain pretreated textual resources data.
For expression, expression directly can be replaced by sky using character string, so as to remove expression.Remove expression
Afterwards, pattern match can be carried out using regular expression, you can remaining Chinese, numeral, letter etc., obtains pre-
Textual resources data after treatment.
Word-dividing mode 650 is used to for pretreated textual resources data to carry out word segmentation processing, obtains the text
The participle of this resource data.Matching module 620 is additionally operable to the participle of textual resources data and crucial character word stock
Matched, and the participle of text resource data is matched with contact method feature database.
Filtering module 660 is used to carry out filtration treatment to promotional literature resource data.
Additionally, word-dividing mode 650 also to the participle of textual resources data can classify obtains the type of participle.
Matching module 620 is additionally operable to be matched with keywords database or contact method feature database according to the type of participle.
The type of participle can be keyword type or contact method characteristic type.
Above-mentioned textual resources data detection device is applied in browser or application program, target text number of resources
According to being advertisement text resource data, then filtering module 660 is additionally operable in browser or application program to advertisement
Textual resources data carry out filtration treatment.
Specifically, browser can be various web browsers etc..Application program can be to provide answering for various services
With APP etc..
Above-mentioned textual resources data detection device, pre-processes by the textual resources data for obtaining, can
Remove interference information, improve the accuracy of matching, then by pretreated textual resources data and crucial words
Storehouse is matched, and textual resources data are matched with contact method feature database, when textual resources number
Match according to crucial character word stock and contact method feature database, then text resource data is target text money
Source data, in this way, matching the matching with contact method by crucial character word stock, can accurately identify mesh
Mark textual resources data, are matched compared to simple crucial character word stock, and accuracy is high, reduces erroneous judgement
Rate.
One of ordinary skill in the art will appreciate that all or part of flow in realizing above-described embodiment method,
Computer program be can be by instruct the hardware of correlation to complete, it is non-easy that described program can be stored in one
In the property lost computer read/write memory medium, the program is upon execution, it may include such as the implementation of above-mentioned each method
The flow of example.Wherein, described storage medium can be magnetic disc, CD, read-only memory (Read-Only
Memory, ROM) etc..
Embodiment described above only expresses several embodiments of the invention, and its description is more specific and detailed,
But therefore can not be interpreted as the limitation to the scope of the claims of the present invention.It should be pointed out that for this area
Those of ordinary skill for, without departing from the inventive concept of the premise, can also make it is some deformation and
Improve, these belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be with appended
Claim is defined.
Claims (12)
1. a kind of textual resources data detection method, comprises the following steps:
Obtain textual resources data;
The textual resources data are matched with crucial character word stock, and by the textual resources data with
Contact method feature database is matched;
When the textual resources data match with the crucial character word stock and the textual resources data with it is described
When contact method feature database matches, the textual resources data are target text resource data.
2. method according to claim 1, it is characterised in that the crucial character word stock includes different moneys
The crucial character word stock of Source Type;
The step of textual resources data are matched with crucial character word stock includes:
The textual resources data are matched respectively with the crucial character word stock of different resource type;
Keyword in the crucial character word stock for including certain resource type in the textual resources data, then institute
Textual resources data are stated to match with the crucial character word stock of the resource type.
3. method according to claim 1, it is characterised in that the contact method feature database includes connection
It is mode dictionary;
The step of textual resources data are matched with contact method feature database includes:
The textual resources data are matched with the contact method dictionary.
4. method according to claim 1, it is characterised in that the contact method feature database includes connection
It is mode dictionary and continuation character amount threshold;
The step of textual resources data are matched with contact method feature database includes:
The textual resources data are matched with the contact method dictionary, and is judged the text money
It is more than or equal to continuation character amount threshold with the presence or absence of continuation character quantity in source data;
When the textual resources data and the contact method dictionary match and are deposited in the textual resources data
Continuation character quantity be more than or equal to continuation character amount threshold, then the textual resources data with it is described
It is that mode feature database matches, otherwise described textual resources data are mismatched with the contact method feature database.
5. method according to claim 1, it is characterised in that in the acquisition textual resources data
After step, methods described also includes:
The textual resources data are pre-processed, pretreated textual resources data are obtained;
Pretreated textual resources data are carried out into word segmentation processing, the participle of the textual resources data is obtained;
It is described that the textual resources data are matched with crucial character word stock, and by the textual resources number
Include according to the step of matching with contact method feature database:
The participle of textual resources data is matched with crucial character word stock, and by the textual resources data
Participle matched with contact method feature database.
6. method according to any one of claim 1 to 5, it is characterised in that methods described application
In browser or application program, the target text resource data is advertisement text resource data;
Methods described also includes:
Filtration treatment is carried out to the promotional literature resource data in browser or application program.
7. a kind of textual resources data detection device, it is characterised in that including:
Acquisition module, for obtaining textual resources data;
Matching module, for the textual resources data to be matched with crucial character word stock, and will be described
Textual resources data are matched with default contact method feature database;
Screening module, for matching with the crucial character word stock and the text when the textual resources data
When resource data matches with the contact method feature database, the textual resources data are target text resource
Data.
8. device according to claim 7, it is characterised in that the default crucial character word stock includes
The crucial character word stock of different resource type;
The matching module is additionally operable to the crucial character word stock point of the textual resources data and different resource type
Do not matched;Key in the crucial character word stock for including certain resource type in the textual resources data
Word, then the textual resources data match with the crucial character word stock of the resource type.
9. device according to claim 7, it is characterised in that the contact method feature database includes connection
It is mode dictionary;
The matching module is additionally operable to be matched the textual resources data with the contact method dictionary.
10. device according to claim 7, it is characterised in that the contact method feature database includes
Contact method dictionary and continuation character amount threshold;
The matching module is additionally operable to be matched the textual resources data with the contact method dictionary,
And judge to be more than or equal to continuation character quantity with the presence or absence of continuation character quantity in the textual resources data
Threshold value;
When the textual resources data and the contact method dictionary match and are deposited in the textual resources data
Continuation character quantity be more than or equal to continuation character amount threshold, then the textual resources data with it is described
It is that mode feature database matches, otherwise described textual resources data are mismatched with the contact method feature database.
11. devices according to claim 7, it is characterised in that described device also includes:
Pretreatment module, for after the acquisition textual resources data, entering to the textual resources data
Row pretreatment, obtains pretreated textual resources data;
Word-dividing mode, for pretreated textual resources data to be carried out into word segmentation processing, obtains the text
The participle of resource data;
The matching module is additionally operable to be matched the participle of textual resources data with crucial character word stock, and
The participle of the textual resources data is matched with contact method feature database.
12. device according to any one of claim 7 to 11, it is characterised in that described device should
In for browser or application program, the target text resource data is advertisement text resource data;
Described device also includes:
Filtering module, for being filtered to the promotional literature resource data in browser or application program
Treatment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510859762.1A CN106815242A (en) | 2015-11-30 | 2015-11-30 | Textual resources data detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510859762.1A CN106815242A (en) | 2015-11-30 | 2015-11-30 | Textual resources data detection method and device |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106815242A true CN106815242A (en) | 2017-06-09 |
Family
ID=59155794
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510859762.1A Pending CN106815242A (en) | 2015-11-30 | 2015-11-30 | Textual resources data detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106815242A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399161A (en) * | 2018-03-06 | 2018-08-14 | 平安科技(深圳)有限公司 | Advertising pictures identification method, electronic device and readable storage medium storing program for executing |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102208992A (en) * | 2010-06-13 | 2011-10-05 | 天津海量信息技术有限公司 | Internet-facing filtration system of unhealthy information and method thereof |
CN102567534A (en) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | Interactive product user generated content intercepting system and intercepting method for the same |
CN102572745A (en) * | 2010-12-24 | 2012-07-11 | 中国移动通信集团上海有限公司 | Method and device for determining waste short message |
US20130173562A1 (en) * | 2004-02-11 | 2013-07-04 | Joshua Alspector | Simplifying Lexicon Creation in Hybrid Duplicate Detection and Inductive Classifier System |
CN104184653A (en) * | 2014-07-28 | 2014-12-03 | 小米科技有限责任公司 | Message filtering method and device |
CN104462509A (en) * | 2014-12-22 | 2015-03-25 | 北京奇虎科技有限公司 | Review spam detection method and device |
-
2015
- 2015-11-30 CN CN201510859762.1A patent/CN106815242A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130173562A1 (en) * | 2004-02-11 | 2013-07-04 | Joshua Alspector | Simplifying Lexicon Creation in Hybrid Duplicate Detection and Inductive Classifier System |
CN102208992A (en) * | 2010-06-13 | 2011-10-05 | 天津海量信息技术有限公司 | Internet-facing filtration system of unhealthy information and method thereof |
CN102572745A (en) * | 2010-12-24 | 2012-07-11 | 中国移动通信集团上海有限公司 | Method and device for determining waste short message |
CN102567534A (en) * | 2011-12-31 | 2012-07-11 | 凤凰在线(北京)信息技术有限公司 | Interactive product user generated content intercepting system and intercepting method for the same |
CN104184653A (en) * | 2014-07-28 | 2014-12-03 | 小米科技有限责任公司 | Message filtering method and device |
CN104462509A (en) * | 2014-12-22 | 2015-03-25 | 北京奇虎科技有限公司 | Review spam detection method and device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108399161A (en) * | 2018-03-06 | 2018-08-14 | 平安科技(深圳)有限公司 | Advertising pictures identification method, electronic device and readable storage medium storing program for executing |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11176142B2 (en) | Method of data query based on evaluation and device | |
US9442905B1 (en) | Detecting neighborhoods from geocoded web documents | |
CN109325179B (en) | Method and device for promoting content | |
CN102473190B (en) | Assign keywords to web pages | |
CN112347767B (en) | Text processing method, device and equipment | |
CN106383887A (en) | Environment-friendly news data acquisition and recommendation display method and system | |
JP6872258B2 (en) | A recording medium that can be read by a computer that embodies the Internet content providing server and its method. | |
US20130246520A1 (en) | Recognizing Social Media Posts, Comments, or other Texts as Business Recommendations or Referrals | |
Fuad et al. | Analysis and classification of mobile apps using topic modeling: A case study on Google Play Arabic apps | |
CN112506981A (en) | Online training service pushing method and device | |
CN107766398A (en) | For the method, apparatus and data handling system for image is matched with content item | |
CN106383862A (en) | Violation short message detection method and system | |
CN114357335A (en) | Information acquisition method, medium, device and computing equipment | |
CN101425981A (en) | Information publishing system and method for publishing information according to mutual exclusive indication | |
KR101606758B1 (en) | Issue data extracting method and system using relevant keyword | |
WO2021189766A1 (en) | Data visualization method and related device | |
CN107943906A (en) | Information collection and display method and device | |
CN111414523A (en) | Data acquisition method and device | |
CN108256078B (en) | Information acquisition method and device | |
US20160162930A1 (en) | Associating Social Comments with Individual Assets Used in a Campaign | |
CN111383072A (en) | User credit scoring method, storage medium and server | |
CN111553487B (en) | Business object identification method and device | |
JP2017004260A (en) | Information processing apparatus, information processing method, and information processing program | |
CN106815242A (en) | Textual resources data detection method and device | |
CN109074365B (en) | Parameterizing network communication paths |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170609 |
|
RJ01 | Rejection of invention patent application after publication |