Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind
State the confidence level of interest point data in the confidence level method and a kind of corresponding website of interest point data in a kind of website of problem
Device.
According to one aspect of the present invention, a kind of determination method of the confidence level of interest point data in website, packet are provided
It includes:
Interest point data is extracted in webpage;
The second target interest point data of mistake is identified from the interest point data;
Statistics belongs to the second quantity of the second target interest point data of the same website;
The confidence level of interest point data in the website is determined according to second quantity.
Optionally, further includes:
When the confidence level is lower than preset second threshold, forbid extracting interest point data from the webpage of the website.
Optionally, described the step of interest point data is extracted in webpage, includes:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
Optionally, the interest point data includes interest point name;
It is described identified from the interest point data mistake the second target interest point data the step of include:
Interest point name set is set by the interest point name for identifying same target;
The second target interest point name of mistake is identified from the interest point name set;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
Optionally, the interest point data includes interest dot address;
Described the step of setting interest point name set for the interest point name for identifying same target includes:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information
Title is set as interest point name set.
Optionally, the step of second target interest point name that mistake is identified from interest point name set packet
It includes:
Interest point name in the interest point name set chooses keyword;
The second target interest point name of mistake is identified from the interest point name according to the keyword.
Optionally, the step of interest point name selection keyword in the interest point name set includes:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name,
Wherein, X is positive integer.
Optionally, the step of interest point name in the interest point name set chooses keyword further include:
When the participle is matched with preset address date, the participle is removed.
Optionally, the second target interest for identifying mistake from the interest point name according to the keyword is called the roll
The step of title includes:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake
Title, wherein Z is positive integer.
Optionally, the interest point data includes URL;
The step of second quantity of the second target interest point data that the statistics belongs to same website includes:
Search the corresponding URL of the second target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second
Second quantity of target interest point data.
Optionally, the step of confidence level of interest point data in the website is determined according to second quantity packet
It includes:
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the error rate.
According to another aspect of the present invention, a kind of decision maker of the confidence level of interest point data in website, packet are provided
It includes:
Interest point data extraction module, suitable for extracting interest point data in webpage;
Mistake interest point data identification module, suitable for identifying the second target point of interest of mistake from the interest point data
Data;
Number of errors statistical module, suitable for counting the second number of the second target interest point data for belonging to the same website
Amount;
Insincere confidence determination module, suitable for determining setting for interest point data in the website according to second quantity
Reliability.
Optionally, further includes:
Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold
Page extracts interest point data.
Optionally, institute's interest point data extraction module is further adapted for:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
Optionally, the interest point data includes interest point name;
The mistake interest point data identification module is further adapted for:
Interest point name set is set by the interest point name for identifying same target;
The second target interest point name of mistake is identified from the interest point name set;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
Optionally, the interest point data includes interest dot address;
The mistake interest point data identification module is further adapted for:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information
Title is set as interest point name set.
Optionally, the wrong interest point data identification module is further adapted for:
Interest point name in the interest point name set chooses keyword;
The second target interest point name of mistake is identified from the interest point name according to the keyword.
Optionally, the wrong interest point data identification module is further adapted for:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name,
Wherein, X is positive integer.
Optionally, the wrong interest point data identification module is further adapted for:
When the participle is matched with preset address date, the participle is removed.
Optionally, the wrong interest point data identification module is further adapted for:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake
Title, wherein Z is positive integer.
Optionally, the interest point data includes URL;
The number of errors statistical module is further adapted for:
Search the corresponding URL of the second target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second
Second quantity of target interest point data.
Optionally, the insincere confidence determination module is further adapted for:
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the error rate.
The embodiment of the present invention identifies the second target interest point data of mistake from the interest point data in webpage extraction, and
The second quantity that statistics belongs to the second target interest point data of the same website determines the confidence level of interest point data in website,
To reject the POI data of these mistakes in subsequent operation, the error rate of operation is reduced, is reduced resource waste.
In turn, forbidden grabbing POI data, the POI number grabbed from these incredible POI data sources according to confidence level
According to correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention,
And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can
It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
It is fully disclosed to those skilled in the art.
Referring to Fig.1, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown
The step flow chart of embodiment 1, can specifically include following steps:
Step 101, interest point data is extracted in webpage;
In embodiments of the present invention, crawler can first pass through the linking relationship between webpage in advance, grab the webpage of internet simultaneously
It saves, the webpage of crawler capturing, which is stored in web database, forms a large amount of searching resource.
For there are more POI data and the regular webpages of POI data distribution tool, as user carries out food and drink, tourism
The webpage in website commented on, the webpage etc. in map web site, can search the template for webpage configuration, in webpage
In, interest point data is extracted in the position according to template instruction, so that a large amount of POI data is got, including associated emerging
Interest point title, interest dot address, URL (Uniform Resource Locator, uniform resource locator) etc..
For example, the part structure of web page of some websites is as follows:
Wherein, " * * * " is domain name.
In the template of this website, interest point name can be extracted in the first row, can be extracted in last line
Interest dot address.
By template, following interest point data is extracted in the webpage of different web sites:
Wherein, " * * * A " and " * * * B " is different domain names.
Step 102, correct first object interest point data is identified from the interest point data;
Correct first object interest point data, refers to the data for meeting point of interest specification in the embodiment of the present invention, including
Correct title, address etc..
In an alternate embodiment of the present invention where, step 102 may include following sub-step:
The interest point name for identifying same target is set interest point name set by sub-step S11;
POI data generally can all identify an object, such as a house, a retail shop, a mailbox, a bus station
Deng.
Since the accuracy of the address information of the object is generally relatively high, in embodiments of the present invention, it can pass through
Interest dot address is normalized, judges whether interest dot address is same or similar;If so, by point of interest address information
Interest point name is set as interest point name set.
For example, " three building, the permanent general merchandise in Yulin road Yu Yangfushi tide today hotel next door east ", " Yulin Yuyang District skin Shi Lujin
Diurnal tide next door east three buildings the first sales departments of permanent general merchandise ", " 3 building, the permanent department store in Yulin south gate Yu Yang mouthful east " and " Yulin south
Wholesale three buildings of the permanent general merchandise in doorway east ", can be true by normalization although this 4 interest dot addresses are not exactly the same in form
The address for determining them is all " three building, the permanent department store in rate in Yuyang county east ".
I.e. associated " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch company " and
" Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " is interest point name set.
Sub-step S12 identifies correct first object interest point name from the interest point name set;
In embodiments of the present invention, correct POI title can be screened by excavating the keyword of interest point name, i.e.,
First object interest point name.
In an alternate embodiment of the present invention where, sub-step S12 can further include following sub-step:
Sub-step S121, the interest point name in the interest point name set choose keyword;
In embodiments of the present invention, keyword can for comprising information content it is maximum, embody the word of interest point name feature.
In the concrete realization, word segmentation processing can be carried out to the interest point name in interest point name set, obtains one
Or multiple participles;
First word frequency of the participle in preset interest point set is searched, which is in the webpage grabbed
The quantity of the set of POI data, the POI data can be up to tens million of, which is according to tens million of POI data
Title statistics.
It, can be using following one or more word segmentation processings in the embodiment of the present invention:
1, based on the participle of string matching: refer to the Chinese character string being analysed to according to certain strategy and one it is preset
Entry in machine dictionary is matched, if finding some character string in dictionary, successful match (identifies a word).
2, the participle based on mark scanning or mark cutting: refer to and preferentially identify and be syncopated as one in character string to be analyzed
Former character string can be divided into lesser go here and there and be come again into mechanical Chinese word segmentation by a little words for having obvious characteristic using these words as breakpoint, from
And reduce matched error rate;Or combine participle and part-of-speech tagging, using grammatical category information abundant to participle decision
Help is provided, and tests, adjust to word segmentation result in turn again in annotation process, to improve the accurate of cutting
Rate.
3, based on the participle of understanding: referring to by allowing the understanding of computer mould personification distich, achieve the effect that identify word.
Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information
Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part
Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould
People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.
4, based on the segmenting method of statistics: referring to, due to the frequency or probability energy of word co-occurrence adjacent with word in Chinese information
Enough preferable confidence levels reflected into word, it is possible to unite to the frequency of each combinatorics on words of co-occurrence adjacent in corpus
Meter calculates their information that appears alternatively, and calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information that appears alternatively can embody Chinese character
Between marriage relation tightness degree.When tightness degree is higher than some threshold value, it can think that this word group may constitute one
A word.
For example, can be segmented as follows with cutting for above-mentioned interest point name:
When the first word frequency is minimum, it includes information content it is generally maximum, then can be by the same interest point name
X minimum participle of one word frequency, the keyword as interest point name, wherein X is positive integer.
For example, following keyword can be extracted for above-mentioned interest point name:
Interest point name |
Keyword |
500 tops of the world enterprise |
The world |
China Ping'an Insurance company |
Safety |
Chinese safety Yulin branch company |
Safety |
Yulin branch company, China Ping'an Insurance Co., Ltd. Branch |
Safety |
Wherein, the first word frequency of the words such as " enterprise ", " company ", " branch company " is higher, and the information content for including is less, only indicates
Business/company identity, directive property is indefinite, is not suitable for as keyword, and first word frequency of the words such as " safety " is more lower, includes
Information content is more, i.e., common enterprise's abbreviation title, is suitable for as keyword.
It should be noted that the address dates such as the province in the whole nation, city, county (area), small towns, road can be obtained in advance, creation
One address database.
When participle is matched with preset address date, for example, " China ", " Yulin " etc., it is invalid keyword, it can
To remove the participle.
Sub-step S122 identifies that correct first object interest is called the roll according to the keyword from the interest point name
Claim.
In the concrete realization, second word frequency of the keyword in interest point name set can be calculated, most by the second word frequency
The work of interest point name belonging to Y high keyword is determined as correct target interest point name, wherein Y is positive integer.
For example, second word frequency in " world " is 1, second word frequency of " safety " for the keyword of above-mentioned interest point name
It is 3, second word frequency of " safety " is higher, and " China Ping'an Insurance company ", " Chinese safety Yulin point public affairs belonging to it can be confirmed
Department " and " Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " are correct first object interest point name.
Sub-step S13 determines that interest point data belonging to the first object interest point name is correct first object
Interest point data.
When the title of POI is correct, it can be confirmed that the POI is correct POI.
Step 103, statistics belongs to the first quantity of the first object interest point data of the same website;
In practical applications, the corresponding URL of first object interest point data can be searched, when the first object point of interest
When the corresponding URL of data belongs to the domain name of the same website, the first quantity of first object interest point data is counted.
For example, for the example of above-mentioned interest point data, " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " in
The URL of state's safety Yulin branch company " belongs to the domain name " * * * A " of the same website, i.e., these interest point names belong to the same net
It stands, the first quantity of the first object interest point data of this website is 2.
Step 104, the confidence level of interest point data in the website is determined according to first quantity.
In the concrete realization, accuracy, the i.e. ratio of the first quantity and total quantity can be calculated according to the first quantity, as above
The accuracy for stating the website that domain name is " * * * A " is 66.67%.
The confidence level of interest point data in website is determined according to accuracy, at this point, confidence level characterizes confidence level.
In one example, accuracy directly can be assigned to confidence level;
In another example, weight can be configured for accuracy in different time periods, which decays according to the time,
The accuracy for configuring weight is calculated into confidence level according to modes such as summations.
Certainly, the calculation of above-mentioned confidence level is intended only as example, in implementing the embodiments of the present invention, can be according to reality
The calculation of other confidence levels is arranged in border situation, and the embodiments of the present invention are not limited thereto.In addition, in addition to above-mentioned confidence level
Calculation outside, those skilled in the art can also according to actual needs use other confidence levels calculation, the present invention
Embodiment is also without restriction to this.
When confidence level is higher than preset first threshold, show the source POI of the website be it is believable, allow from the website
Webpage extract interest point data.
The embodiment of the present invention identifies correct first object interest point data from the interest point data in webpage extraction, and
The first quantity that statistics belongs to the first object interest point data of the same website determines the confidence level of interest point data in website,
To apply these correct POI datas in subsequent operation, the error rate of operation is reduced, is reduced resource waste.
In turn, allowed to grab POI data, the POI data grabbed from these believable POI data sources according to confidence level
Correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.
Referring to Fig. 2, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown
The step flow chart of embodiment 2, can specifically include following steps:
Step 201, interest point data is extracted in webpage;
In embodiments of the present invention, crawler can first pass through the linking relationship between webpage in advance, grab the webpage of internet simultaneously
It saves, the webpage of crawler capturing, which is stored in web database, forms a large amount of searching resource.
For there are more POI data and the regular webpages of POI data distribution tool, as user carries out food and drink, tourism
The webpage in website commented on, the webpage etc. in map web site, can search the template for webpage configuration, in webpage
In, interest point data is extracted in the position according to template instruction, so that a large amount of POI data is got, including associated emerging
Interest point title, interest dot address, URL (Uniform Resource Locator, uniform resource locator) etc..
For example, the part structure of web page of some websites is as follows:
Wherein, " * * * " is domain name.
In the template of this website, interest point name can be extracted in the first row, can be extracted in last line
Interest dot address.
By template, following interest point data is extracted in the webpage of different web sites:
Wherein, " * * * A " and " * * * B " is different domain names.
Step 202, the second target interest point data of mistake is identified from the interest point data;
The second wrong target interest point data, refers to the data for not meeting point of interest specification in the embodiment of the present invention, packet
Include title, the address etc. of mistake.
In an alternate embodiment of the present invention where, step 202 may include following sub-step:
The interest point name for identifying same target is set interest point name set by sub-step S21;
POI data generally can all identify an object, such as a house, a retail shop, a mailbox, a bus station
Deng.
Since the accuracy of the address information of the object is generally relatively high, in embodiments of the present invention, it can pass through
Interest dot address is normalized, judges whether interest dot address is same or similar;If so, by point of interest address information
Interest point name is set as interest point name set.
For example, " three building, the permanent general merchandise in Yulin road Yu Yangfushi tide today hotel next door east ", " Yulin Yuyang District skin Shi Lujin
Diurnal tide next door east three buildings the first sales departments of permanent general merchandise ", " 3 building, the permanent department store in Yulin south gate Yu Yang mouthful east " and " Yulin south
Wholesale three buildings of the permanent general merchandise in doorway east ", can be true by normalization although this 4 interest dot addresses are not exactly the same in form
The address for determining them is all " three building, the permanent department store in rate in Yuyang county east ".
I.e. associated " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch company " and
" Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " is interest point name set.
Sub-step S22 identifies the second target interest point name of mistake from the interest point name set;
In embodiments of the present invention, the POI title of mistake can be screened by excavating the keyword of interest point name, i.e.,
Second target interest point name.
In an alternate embodiment of the present invention where, sub-step S22 can further include following sub-step:
Sub-step S121, the interest point name in the interest point name set choose keyword;
In embodiments of the present invention, keyword can for comprising information content it is maximum, embody the word of interest point name feature.
In the concrete realization, word segmentation processing can be carried out to the interest point name in interest point name set, obtains one
Or multiple participles;
First word frequency of the participle in preset interest point set is searched, which is in the webpage grabbed
The quantity of the set of POI data, the POI data can be up to tens million of, which is according to tens million of POI data
Title statistics.
It, can be using following one or more word segmentation processings in the embodiment of the present invention:
1, based on the participle of string matching: refer to the Chinese character string being analysed to according to certain strategy and one it is preset
Entry in machine dictionary is matched, if finding some character string in dictionary, successful match (identifies a word).
2, the participle based on mark scanning or mark cutting: refer to and preferentially identify and be syncopated as one in character string to be analyzed
Former character string can be divided into lesser go here and there and be come again into mechanical Chinese word segmentation by a little words for having obvious characteristic using these words as breakpoint, from
And reduce matched error rate;Or combine participle and part-of-speech tagging, using grammatical category information abundant to participle decision
Help is provided, and tests, adjust to word segmentation result in turn again in annotation process, to improve the accurate of cutting
Rate.
3, based on the participle of understanding: referring to by allowing the understanding of computer mould personification distich, achieve the effect that identify word.
Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information
Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part
Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould
People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.
4, based on the segmenting method of statistics: referring to, due to the frequency or probability energy of word co-occurrence adjacent with word in Chinese information
Enough preferable confidence levels reflected into word, it is possible to unite to the frequency of each combinatorics on words of co-occurrence adjacent in corpus
Meter calculates their information that appears alternatively, and calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information that appears alternatively can embody Chinese character
Between marriage relation tightness degree.When tightness degree is higher than some threshold value, it can think that this word group may constitute one
A word.
For example, can be segmented as follows with cutting for above-mentioned interest point name:
When the first word frequency is minimum, it includes information content it is generally maximum, then can be by the same interest point name
X minimum participle of one word frequency, the keyword as interest point name, wherein X is positive integer.
For example, following keyword can be extracted for above-mentioned interest point name:
Interest point name |
Keyword |
500 tops of the world enterprise |
The world |
China Ping'an Insurance company |
Safety |
Chinese safety Yulin branch company |
Safety |
Yulin branch company, China Ping'an Insurance Co., Ltd. Branch |
Safety |
Wherein, the first word frequency of the words such as " enterprise ", " company ", " branch company " is higher, and the information content for including is less, only indicates
Business/company identity, directive property is indefinite, is not suitable for as keyword, and first word frequency of the words such as " safety " is more lower, includes
Information content is more, i.e., common enterprise's abbreviation title, is suitable for as keyword.
It should be noted that the address dates such as the province in the whole nation, city, county (area), small towns, road can be obtained in advance, creation
One address database.
When participle is matched with preset address date, for example, " China ", " Yulin " etc., it is invalid keyword, it can
To remove the participle.
Sub-step S222 identifies that the second target interest of mistake is called the roll according to the keyword from the interest point name
Claim.
In the concrete realization, second word frequency of the keyword in interest point name set can be calculated, most by the second word frequency
The work of interest point name belonging to Z low keyword is determined as correct target interest point name, wherein Z is positive integer.
For example, second word frequency in " world " is 1, second word frequency of " safety " for the keyword of above-mentioned interest point name
It is 3, second word frequency in " world " is lower, can be confirmed that " 500 tops of the world enterprise " belonging to it is the second target interest of mistake
Point title.
Sub-step S23 determines that interest point data belonging to the second target interest point name is the second target of mistake
Interest point data.
When the Name Error of POI, it can be confirmed that the POI is the POI of mistake.
Step 203, statistics belongs to the second quantity of the second target interest point data of the same website;
In practical applications, the corresponding URL of the second target interest point data can be searched, when the second target point of interest
When the corresponding URL of data belongs to the domain name of the same website, the second quantity of the second target interest point data is counted.
For example, for the example of above-mentioned interest point data, " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " in
The URL of state's safety Yulin branch company " belongs to the domain name " * * * A " of the same website, i.e., these interest point names belong to the same net
It stands, the first quantity of the second target interest point data of this website is 1.
Step 204, the confidence level of interest point data in the website is determined according to second quantity.
In the concrete realization, can according to the second quantity calculate error rate, i.e., two and quantity and total quantity ratio, as above
The error rate for stating the website that domain name is " * * * A " is 33.33%.
The confidence level of interest point data in website is determined according to accuracy, at this point, confidence level characterization can not reliability.
In one example, accuracy directly can be assigned to confidence level;
In another example, weight can be configured for error rate in different time periods, which decays according to the time,
The error rate for configuring weight is calculated into confidence level according to modes such as summations.
Certainly, the calculation of above-mentioned confidence level is intended only as example, in implementing the embodiments of the present invention, can be according to reality
The calculation of other confidence levels is arranged in border situation, and the embodiments of the present invention are not limited thereto.In addition, in addition to above-mentioned confidence level
Calculation outside, those skilled in the art can also according to actual needs use other confidence levels calculation, the present invention
Embodiment is also without restriction to this.
When confidence level be lower than preset second threshold when, show the source POI of the website be it is incredible, forbid from the net
The webpage stood extracts interest point data.
The embodiment of the present invention identifies the second target interest point data of mistake from the interest point data in webpage extraction, and
The second quantity that statistics belongs to the second target interest point data of the same website determines the confidence level of interest point data in website,
To reject the POI data of these mistakes in subsequent operation, the error rate of operation is reduced, is reduced resource waste.
In turn, forbidden grabbing POI data, the POI number grabbed from these incredible POI data sources according to confidence level
According to correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.
Referring to Fig. 3, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown
The step flow chart of embodiment 3, can specifically include following steps:
Step 301, interest point data is extracted in webpage;
Step 302, the second mesh of correct first object interest point data and mistake is identified from the interest point data
Mark interest point data;
Step 303, the first quantity and the second target of the first object interest point data of the same website of statistics ownership are emerging
Second quantity of interesting point data;
Step 304, the confidence of interest point data in the website is determined according to first quantity and second quantity
Degree.
In an alternate embodiment of the present invention where, this method can also include the following steps:
Step 305, when the confidence level is higher than preset first threshold, allow to extract interest from the webpage of the website
Point data;
Step 306, when the confidence level is lower than preset second threshold, forbid extracting interest from the webpage of the website
Point data.
In an alternate embodiment of the present invention where, step 301 may include following sub-step:
Sub-step S31 searches the template for webpage configuration;
Sub-step S32, in the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;Step 302 can wrap
Include following sub-step:
The interest point name for identifying same target is set interest point name set by sub-step S41;
Sub-step S42 identifies correct first object interest point name and mistake from the interest point name set
Second target interest point name;
Sub-step S43 determines that interest point data belonging to the first object interest point name is correct first object
Interest point data;
Sub-step S44 determines that interest point data belonging to the second target interest point name is the second target of mistake
Interest point data.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;Sub-step S41 is into one
Step may include following sub-step:
Sub-step S411 judges whether the interest dot address is same or similar;If so, executing sub-step S412;
The interest point name of the point of interest address information is set interest point name set by sub-step S412.
In an alternate embodiment of the present invention where, sub-step S42 can further include following sub-step:
Sub-step S421, the interest point name in the interest point name set choose keyword;
Sub-step S422 identifies that correct first object interest is called the roll according to the keyword from the interest point name
Claim the second target interest point name with mistake.
In an alternate embodiment of the present invention where, sub-step S421 can further include following sub-step:
Sub-step S4211, in the interest point name set interest point name carry out word segmentation processing, obtain one or
Multiple participles;
Sub-step S4212 searches first word frequency of the participle in preset interest point set;
Sub-step S4213, by the X participle that the first word frequency is minimum in the same interest point name, as the point of interest
The keyword of title, wherein X is positive integer.
In an alternate embodiment of the present invention where, sub-step S421 can also include further following sub-step:
Sub-step S4214 removes the participle when the participle is matched with preset address date.
In an alternate embodiment of the present invention where, sub-step S422 can further include following sub-step:
Sub-step S4221 calculates second word frequency of the keyword in the interest point set;
Sub-step S4222, using interest point name belonging to the highest Y keyword of second word frequency as correct
One target interest point name;
Sub-step S4223, using interest point name belonging to Z minimum keyword of second word frequency as the of mistake
Two target interest point names, wherein Y, Z are positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;Step 303 may include as follows
Sub-step:
Sub-step S51 searches the corresponding URL of the first object interest point data and the second target interest point data
Corresponding URL;
Sub-step S52, when the corresponding URL of the first object interest point data belongs to the domain name of the same website, system
Count the first quantity of the first object interest point data;
Sub-step S53, when the corresponding URL of the second target interest point data belongs to the domain name of the same website, system
Count the second quantity of the second target interest point data.
In an alternate embodiment of the present invention where, step 304 may include following sub-step:
Sub-step S61 calculates accuracy according to first quantity;
Sub-step S62 calculates error rate according to second quantity;
Sub-step S63 determines the confidence level of interest point data in the website according to the accuracy and the error rate.
In embodiments of the present invention, due to substantially similar to the application of embodiment of the method 1,2, so the comparison of description is simple
Single, related place illustrates that the embodiment of the present invention is not described in detail herein referring to the part of embodiment of the method 1,2.
For embodiment of the method, for simple description, therefore, it is stated as a series of action combinations, but this field
Technical staff should be aware of, and embodiment of that present invention are not limited by the describe sequence of actions, because implementing according to the present invention
Example, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know that, specification
Described in embodiment belong to preferred embodiment, the actions involved are not necessarily necessary for embodiments of the present invention.
Referring to Fig. 4, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown
The structural block diagram for determining Installation practice 1, can specifically include following module:
Interest point data extraction module 401, suitable for extracting interest point data in webpage;
Correct interest point data identification module 402, suitable for identifying that correct first object is emerging from the interest point data
Interesting point data;
Correct number statistical module 403, suitable for counting the first of the first object interest point data for belonging to the same website
Quantity;
Credible confidence determination module 404, suitable for determining interest point data in the website according to first quantity
Confidence level.
In an alternate embodiment of the present invention where, which can also include following module:
Allow extraction module, is suitable for allowing the net from the website when the confidence level is higher than preset first threshold
Page extracts interest point data.
In an alternate embodiment of the present invention where, institute's interest point data extraction module 401 can be adapted to:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;
The correct interest point data identification module 402 can be adapted to:
Interest point name set is set by the interest point name for identifying same target;
Correct first object interest point name is identified from the interest point name set;
Determine that interest point data belonging to the first object interest point name is correct first object interest point data.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;
The correct interest point data identification module 402 can be adapted to:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information
Title is set as interest point name set.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
Interest point name in the interest point name set chooses keyword;
Correct first object interest point name is identified from the interest point name according to the keyword.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name,
Wherein, X is positive integer.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
When the participle is matched with preset address date, the participle is removed.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to the highest Y keyword of second word frequency as correct first object point of interest
Title, wherein Y is positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;
The correct number statistical module 403 can be adapted to:
Search the corresponding URL of the first object interest point data;
When the corresponding URL of the first object interest point data belongs to the domain name of the same website, statistics described first
First quantity of target interest point data.
In an alternate embodiment of the present invention where, the credible confidence determination module 404 can be adapted to:
Accuracy is calculated according to first quantity;
The confidence level of interest point data in the website is determined according to the accuracy.
Referring to Fig. 5, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown
The structural block diagram for determining Installation practice 2, can specifically include following module:
Interest point data extraction module 501, suitable for extracting interest point data in webpage;
Mistake interest point data identification module 502, the second target suitable for identifying mistake from the interest point data are emerging
Interesting point data;
Number of errors statistical module 503, suitable for counting the second of the second target interest point data for belonging to the same website
Quantity;
Insincere confidence determination module 504, suitable for determining interest point data in the website according to second quantity
Confidence level.
In an alternate embodiment of the present invention where, which can also include following module:
Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold
Page extracts interest point data.
In an alternate embodiment of the present invention where, institute's interest point data extraction module 501 can be adapted to:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;
The mistake interest point data identification module 502 can be adapted to:
Interest point name set is set by the interest point name for identifying same target;
The second target interest point name of mistake is identified from the interest point name set;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;
The mistake interest point data identification module 502 can be adapted to:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information
Title is set as interest point name set.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
Interest point name in the interest point name set chooses keyword;
The second target interest point name of mistake is identified from the interest point name according to the keyword.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name,
Wherein, X is positive integer.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
When the participle is matched with preset address date, the participle is removed.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake
Title, wherein Z is positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;
The number of errors statistical module 503 can be adapted to:
Search the corresponding URL of the second target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second
Second quantity of target interest point data.
In an alternate embodiment of the present invention where, the insincere confidence determination module 504 can be adapted to:
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the error rate.
Referring to Fig. 6, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown
The structural block diagram for determining Installation practice 3, can specifically include following module:
Interest point data extraction module 601, suitable for extracting interest point data in webpage;
Interest point data identification module 602, suitable for identifying correct first object point of interest from the interest point data
Second target interest point data of data and mistake;
Quantity statistics module 603, suitable for counting the first quantity for belonging to the first object interest point data of the same website
With the second quantity of the second target interest point data;
Confidence determination module 604, it is emerging in the website suitable for being determined according to first quantity and second quantity
The confidence level of interesting point data.
In an alternate embodiment of the present invention where, which can also include following module:
Allow extraction module, is suitable for allowing the net from the website when the confidence level is higher than preset first threshold
Page extracts interest point data;
Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold
Page extracts interest point data.
In an alternate embodiment of the present invention where, institute's interest point data extraction module 601 can be adapted to:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;
The interest point data identification module 602 can be adapted to:
Interest point name set is set by the interest point name for identifying same target;
Identify that the second target of correct first object interest point name and mistake is emerging from the interest point name set
Interest point title;
Determine that interest point data belonging to the first object interest point name is correct first object interest point data;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;
The interest point data identification module 602 can be adapted to:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information
Title is set as interest point name set.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
Interest point name in the interest point name set chooses keyword;
Correct first object interest point name and mistake are identified from the interest point name according to the keyword
Second target interest point name.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name,
Wherein, X is positive integer.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
When the participle is matched with preset address date, the participle is removed.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to the highest Y keyword of second word frequency as correct first object point of interest
Title;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake
Title, wherein Y, Z are positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;
The quantity statistics module 403 can be adapted to:
Search the corresponding URL of the first object interest point data and the corresponding URL of the second target interest point data;
When the corresponding URL of the first object interest point data belongs to the domain name of the same website, statistics described first
First quantity of target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second
Second quantity of target interest point data.
In an alternate embodiment of the present invention where, the confidence determination module 604 can be adapted to:
Accuracy is calculated according to first quantity;
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the accuracy and the error rate.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple
Place illustrates referring to the part of embodiment of the method.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein.
Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system
Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various
Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair
Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention
Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects,
Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect
Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself
All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment
Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or
Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any
Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed
All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power
Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose
It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments
In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention
Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed
Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors
Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice
Microprocessor or digital signal processor (DSP) realize the confidence of interest point data in website according to an embodiment of the present invention
The some or all functions of some or all components in the judgement equipment of degree.The present invention is also implemented as executing
Some or all device or device programs of method as described herein are (for example, computer program and computer journey
Sequence product).It is such to realize that program of the invention can store on a computer-readable medium, either can have one or
The form of multiple signals.Such signal can be downloaded from an internet website to obtain, be perhaps provided on the carrier signal or
It is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability
Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not
Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real
It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch
To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame
Claim.