CN105160032B

CN105160032B - The determination method and device of the confidence level of interest point data in a kind of website

Info

Publication number: CN105160032B
Application number: CN201510642636.0A
Authority: CN
Inventors: 王智广
Original assignee: Beijing Qihoo Technology Co Ltd; Qizhi Software Beijing Co Ltd
Current assignee: Beijing Qihoo Technology Co Ltd
Priority date: 2015-09-30
Filing date: 2015-09-30
Publication date: 2019-05-31
Anticipated expiration: 2035-09-30
Also published as: CN105160032A

Abstract

The embodiments of the present invention provide a method and device for determining the confidence of POI data in a website, the method includes: extracting POI data from a webpage; identifying wrong second target POI data from the POI data ; Counting the second quantity of the second target POI data belonging to the same website; determining the confidence level of the POI data in the website according to the second quantity. The embodiment of the present invention prohibits the capture of POI data from these untrusted POI data sources according to the confidence, the captured POI data has high accuracy, reduces the waste of computer system resources and bandwidth resources, and improves the capture of POI data. efficiency.

Description

The determination method and device of the confidence level of interest point data in a kind of website

Technical field

The present invention relates to the technical fields of computer disposal, more particularly to a kind of confidence level of interest point data in website The confidence level device of interest point data in method and a kind of website.

Background technique

Point of interest (Point of Interest, POI), and be properly termed as " information point ", it includes various information, Such as title, classification, latitude, longitude.

In GIS-Geographic Information System, a POI can be a house, a retail shop, a mailbox, a bus station Deng.

Traditional geographical information collection method needs ground mapping personnel to go acquisition one emerging using accurate instrument of surveying and mapping The longitude and latitude of interest point, then marks again.

Just because of the acquisition of POI data is a very time-consuming bothersome job, for a GIS-Geographic Information System, The quantity of POI is in the value that represent whole system to a certain degree.

In order to enrich GIS-Geographic Information System POI data quantity, POI data is excavated from webpage at present, is root mostly Suitable template is configured according to the structure of webpage, is extracted by template.

But user not necessarily goes to release news according to the regulation of webpage, so that being filled in these websites comprising POI Denounce a large amount of dirty data, is the POI data of mistake.

For example, a region of some websites agreement webpage is publication Business Name, still, some users may issue all Such as " five top 100 enterprises of the world " data are not a real POI titles.

If the POI data of these mistakes of subsequent applications carries out the operation such as navigate, the error rate of operation is high, causes resource unrestrained Take.

Also, computer grabs always the POI data of these mistakes, wastes system for computer resource and bandwidth resources, It is very low that POI data grabs efficiency.

Summary of the invention

In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind State the confidence level of interest point data in the confidence level method and a kind of corresponding website of interest point data in a kind of website of problem Device.

According to one aspect of the present invention, a kind of determination method of the confidence level of interest point data in website, packet are provided It includes:

Interest point data is extracted in webpage；

The second target interest point data of mistake is identified from the interest point data；

Statistics belongs to the second quantity of the second target interest point data of the same website；

The confidence level of interest point data in the website is determined according to second quantity.

Optionally, further includes:

When the confidence level is lower than preset second threshold, forbid extracting interest point data from the webpage of the website.

Optionally, described the step of interest point data is extracted in webpage, includes:

Search the template for webpage configuration；

In the webpage, interest point data is extracted in the position according to template instruction.

Optionally, the interest point data includes interest point name；

It is described identified from the interest point data mistake the second target interest point data the step of include:

Interest point name set is set by the interest point name for identifying same target；

The second target interest point name of mistake is identified from the interest point name set；

Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.

Optionally, the interest point data includes interest dot address；

Described the step of setting interest point name set for the interest point name for identifying same target includes:

Judge whether the interest dot address is same or similar；If so, by the point of interest of the point of interest address information Title is set as interest point name set.

Optionally, the step of second target interest point name that mistake is identified from interest point name set packet It includes:

Interest point name in the interest point name set chooses keyword；

The second target interest point name of mistake is identified from the interest point name according to the keyword.

Optionally, the step of interest point name selection keyword in the interest point name set includes:

Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles；

Search first word frequency of the participle in preset interest point set；

By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name, Wherein, X is positive integer.

Optionally, the step of interest point name in the interest point name set chooses keyword further include:

When the participle is matched with preset address date, the participle is removed.

Optionally, the second target interest for identifying mistake from the interest point name according to the keyword is called the roll The step of title includes:

Calculate second word frequency of the keyword in the interest point set；

Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake Title, wherein Z is positive integer.

Optionally, the interest point data includes URL；

The step of second quantity of the second target interest point data that the statistics belongs to same website includes:

Search the corresponding URL of the second target interest point data；

When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second Second quantity of target interest point data.

Optionally, the step of confidence level of interest point data in the website is determined according to second quantity packet It includes:

Error rate is calculated according to second quantity；

The confidence level of interest point data in the website is determined according to the error rate.

According to another aspect of the present invention, a kind of decision maker of the confidence level of interest point data in website, packet are provided It includes:

Interest point data extraction module, suitable for extracting interest point data in webpage；

Mistake interest point data identification module, suitable for identifying the second target point of interest of mistake from the interest point data Data；

Number of errors statistical module, suitable for counting the second number of the second target interest point data for belonging to the same website Amount；

Insincere confidence determination module, suitable for determining setting for interest point data in the website according to second quantity Reliability.

Optionally, further includes:

Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold Page extracts interest point data.

Optionally, institute's interest point data extraction module is further adapted for:

Search the template for webpage configuration；

Optionally, the interest point data includes interest point name；

The mistake interest point data identification module is further adapted for:

Optionally, the interest point data includes interest dot address；

The mistake interest point data identification module is further adapted for:

Optionally, the wrong interest point data identification module is further adapted for:

Interest point name in the interest point name set chooses keyword；

Search first word frequency of the participle in preset interest point set；

Calculate second word frequency of the keyword in the interest point set；

Optionally, the interest point data includes URL；

The number of errors statistical module is further adapted for:

Search the corresponding URL of the second target interest point data；

Optionally, the insincere confidence determination module is further adapted for:

Error rate is calculated according to second quantity；

The embodiment of the present invention identifies the second target interest point data of mistake from the interest point data in webpage extraction, and The second quantity that statistics belongs to the second target interest point data of the same website determines the confidence level of interest point data in website, To reject the POI data of these mistakes in subsequent operation, the error rate of operation is reduced, is reduced resource waste.

In turn, forbidden grabbing POI data, the POI number grabbed from these incredible POI data sources according to confidence level According to correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:

Fig. 1 shows the confidence level embodiment of the method for interest point data in a kind of website according to an embodiment of the invention 1 step flow chart；

Fig. 2 shows the confidence level embodiments of the method for interest point data in a kind of website according to an embodiment of the invention 2 step flow chart；

Fig. 3 shows the confidence level embodiment of the method for interest point data in a kind of website according to an embodiment of the invention 3 step flow chart；

Fig. 4 shows the decision maker of the confidence level of interest point data in a kind of website according to an embodiment of the invention The structural block diagram of embodiment 1；

Fig. 5 shows the decision maker of the confidence level of interest point data in a kind of website according to an embodiment of the invention The structural block diagram of embodiment 2；And

Fig. 6 shows the decision maker of the confidence level of interest point data in a kind of website according to an embodiment of the invention The structural block diagram of embodiment 3.

Specific embodiment

Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.

Referring to Fig.1, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown The step flow chart of embodiment 1, can specifically include following steps:

Step 101, interest point data is extracted in webpage；

In embodiments of the present invention, crawler can first pass through the linking relationship between webpage in advance, grab the webpage of internet simultaneously It saves, the webpage of crawler capturing, which is stored in web database, forms a large amount of searching resource.

For there are more POI data and the regular webpages of POI data distribution tool, as user carries out food and drink, tourism The webpage in website commented on, the webpage etc. in map web site, can search the template for webpage configuration, in webpage In, interest point data is extracted in the position according to template instruction, so that a large amount of POI data is got, including associated emerging Interest point title, interest dot address, URL (Uniform Resource Locator, uniform resource locator) etc..

For example, the part structure of web page of some websites is as follows:

Wherein, " * * * " is domain name.

In the template of this website, interest point name can be extracted in the first row, can be extracted in last line Interest dot address.

By template, following interest point data is extracted in the webpage of different web sites:

Wherein, " * * * A " and " * * * B " is different domain names.

Step 102, correct first object interest point data is identified from the interest point data；

Correct first object interest point data, refers to the data for meeting point of interest specification in the embodiment of the present invention, including Correct title, address etc..

In an alternate embodiment of the present invention where, step 102 may include following sub-step:

The interest point name for identifying same target is set interest point name set by sub-step S11；

POI data generally can all identify an object, such as a house, a retail shop, a mailbox, a bus station Deng.

Since the accuracy of the address information of the object is generally relatively high, in embodiments of the present invention, it can pass through Interest dot address is normalized, judges whether interest dot address is same or similar；If so, by point of interest address information Interest point name is set as interest point name set.

For example, " three building, the permanent general merchandise in Yulin road Yu Yangfushi tide today hotel next door east ", " Yulin Yuyang District skin Shi Lujin Diurnal tide next door east three buildings the first sales departments of permanent general merchandise ", " 3 building, the permanent department store in Yulin south gate Yu Yang mouthful east " and " Yulin south Wholesale three buildings of the permanent general merchandise in doorway east ", can be true by normalization although this 4 interest dot addresses are not exactly the same in form The address for determining them is all " three building, the permanent department store in rate in Yuyang county east ".

I.e. associated " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch company " and " Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " is interest point name set.

Sub-step S12 identifies correct first object interest point name from the interest point name set；

In embodiments of the present invention, correct POI title can be screened by excavating the keyword of interest point name, i.e., First object interest point name.

In an alternate embodiment of the present invention where, sub-step S12 can further include following sub-step:

Sub-step S121, the interest point name in the interest point name set choose keyword；

In embodiments of the present invention, keyword can for comprising information content it is maximum, embody the word of interest point name feature.

In the concrete realization, word segmentation processing can be carried out to the interest point name in interest point name set, obtains one Or multiple participles；

First word frequency of the participle in preset interest point set is searched, which is in the webpage grabbed The quantity of the set of POI data, the POI data can be up to tens million of, which is according to tens million of POI data Title statistics.

It, can be using following one or more word segmentation processings in the embodiment of the present invention:

1, based on the participle of string matching: refer to the Chinese character string being analysed to according to certain strategy and one it is preset Entry in machine dictionary is matched, if finding some character string in dictionary, successful match (identifies a word).

2, the participle based on mark scanning or mark cutting: refer to and preferentially identify and be syncopated as one in character string to be analyzed Former character string can be divided into lesser go here and there and be come again into mechanical Chinese word segmentation by a little words for having obvious characteristic using these words as breakpoint, from And reduce matched error rate；Or combine participle and part-of-speech tagging, using grammatical category information abundant to participle decision Help is provided, and tests, adjust to word segmentation result in turn again in annotation process, to improve the accurate of cutting Rate.

3, based on the participle of understanding: referring to by allowing the understanding of computer mould personification distich, achieve the effect that identify word. Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.

4, based on the segmenting method of statistics: referring to, due to the frequency or probability energy of word co-occurrence adjacent with word in Chinese information Enough preferable confidence levels reflected into word, it is possible to unite to the frequency of each combinatorics on words of co-occurrence adjacent in corpus Meter calculates their information that appears alternatively, and calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information that appears alternatively can embody Chinese character Between marriage relation tightness degree.When tightness degree is higher than some threshold value, it can think that this word group may constitute one A word.

For example, can be segmented as follows with cutting for above-mentioned interest point name:

When the first word frequency is minimum, it includes information content it is generally maximum, then can be by the same interest point name X minimum participle of one word frequency, the keyword as interest point name, wherein X is positive integer.

For example, following keyword can be extracted for above-mentioned interest point name:

Interest point name	Keyword
		500 tops of the world enterprise	The world
China Ping'an Insurance company	Safety
		Chinese safety Yulin branch company	Safety
Yulin branch company, China Ping'an Insurance Co., Ltd. Branch	Safety

Wherein, the first word frequency of the words such as " enterprise ", " company ", " branch company " is higher, and the information content for including is less, only indicates Business/company identity, directive property is indefinite, is not suitable for as keyword, and first word frequency of the words such as " safety " is more lower, includes Information content is more, i.e., common enterprise's abbreviation title, is suitable for as keyword.

It should be noted that the address dates such as the province in the whole nation, city, county (area), small towns, road can be obtained in advance, creation One address database.

When participle is matched with preset address date, for example, " China ", " Yulin " etc., it is invalid keyword, it can To remove the participle.

Sub-step S122 identifies that correct first object interest is called the roll according to the keyword from the interest point name Claim.

In the concrete realization, second word frequency of the keyword in interest point name set can be calculated, most by the second word frequency The work of interest point name belonging to Y high keyword is determined as correct target interest point name, wherein Y is positive integer.

For example, second word frequency in " world " is 1, second word frequency of " safety " for the keyword of above-mentioned interest point name It is 3, second word frequency of " safety " is higher, and " China Ping'an Insurance company ", " Chinese safety Yulin point public affairs belonging to it can be confirmed Department " and " Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " are correct first object interest point name.

Sub-step S13 determines that interest point data belonging to the first object interest point name is correct first object Interest point data.

When the title of POI is correct, it can be confirmed that the POI is correct POI.

Step 103, statistics belongs to the first quantity of the first object interest point data of the same website；

In practical applications, the corresponding URL of first object interest point data can be searched, when the first object point of interest When the corresponding URL of data belongs to the domain name of the same website, the first quantity of first object interest point data is counted.

For example, for the example of above-mentioned interest point data, " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " in The URL of state's safety Yulin branch company " belongs to the domain name " * * * A " of the same website, i.e., these interest point names belong to the same net It stands, the first quantity of the first object interest point data of this website is 2.

Step 104, the confidence level of interest point data in the website is determined according to first quantity.

In the concrete realization, accuracy, the i.e. ratio of the first quantity and total quantity can be calculated according to the first quantity, as above The accuracy for stating the website that domain name is " * * * A " is 66.67%.

The confidence level of interest point data in website is determined according to accuracy, at this point, confidence level characterizes confidence level.

In one example, accuracy directly can be assigned to confidence level；

In another example, weight can be configured for accuracy in different time periods, which decays according to the time, The accuracy for configuring weight is calculated into confidence level according to modes such as summations.

Certainly, the calculation of above-mentioned confidence level is intended only as example, in implementing the embodiments of the present invention, can be according to reality The calculation of other confidence levels is arranged in border situation, and the embodiments of the present invention are not limited thereto.In addition, in addition to above-mentioned confidence level Calculation outside, those skilled in the art can also according to actual needs use other confidence levels calculation, the present invention Embodiment is also without restriction to this.

When confidence level is higher than preset first threshold, show the source POI of the website be it is believable, allow from the website Webpage extract interest point data.

The embodiment of the present invention identifies correct first object interest point data from the interest point data in webpage extraction, and The first quantity that statistics belongs to the first object interest point data of the same website determines the confidence level of interest point data in website, To apply these correct POI datas in subsequent operation, the error rate of operation is reduced, is reduced resource waste.

In turn, allowed to grab POI data, the POI data grabbed from these believable POI data sources according to confidence level Correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.

Referring to Fig. 2, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown The step flow chart of embodiment 2, can specifically include following steps:

Step 201, interest point data is extracted in webpage；

For example, the part structure of web page of some websites is as follows:

Wherein, " * * * " is domain name.

Wherein, " * * * A " and " * * * B " is different domain names.

Step 202, the second target interest point data of mistake is identified from the interest point data；

The second wrong target interest point data, refers to the data for not meeting point of interest specification in the embodiment of the present invention, packet Include title, the address etc. of mistake.

In an alternate embodiment of the present invention where, step 202 may include following sub-step:

The interest point name for identifying same target is set interest point name set by sub-step S21；

Sub-step S22 identifies the second target interest point name of mistake from the interest point name set；

In embodiments of the present invention, the POI title of mistake can be screened by excavating the keyword of interest point name, i.e., Second target interest point name.

In an alternate embodiment of the present invention where, sub-step S22 can further include following sub-step:

Sub-step S222 identifies that the second target interest of mistake is called the roll according to the keyword from the interest point name Claim.

In the concrete realization, second word frequency of the keyword in interest point name set can be calculated, most by the second word frequency The work of interest point name belonging to Z low keyword is determined as correct target interest point name, wherein Z is positive integer.

For example, second word frequency in " world " is 1, second word frequency of " safety " for the keyword of above-mentioned interest point name It is 3, second word frequency in " world " is lower, can be confirmed that " 500 tops of the world enterprise " belonging to it is the second target interest of mistake Point title.

Sub-step S23 determines that interest point data belonging to the second target interest point name is the second target of mistake Interest point data.

When the Name Error of POI, it can be confirmed that the POI is the POI of mistake.

Step 203, statistics belongs to the second quantity of the second target interest point data of the same website；

In practical applications, the corresponding URL of the second target interest point data can be searched, when the second target point of interest When the corresponding URL of data belongs to the domain name of the same website, the second quantity of the second target interest point data is counted.

For example, for the example of above-mentioned interest point data, " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " in The URL of state's safety Yulin branch company " belongs to the domain name " * * * A " of the same website, i.e., these interest point names belong to the same net It stands, the first quantity of the second target interest point data of this website is 1.

Step 204, the confidence level of interest point data in the website is determined according to second quantity.

In the concrete realization, can according to the second quantity calculate error rate, i.e., two and quantity and total quantity ratio, as above The error rate for stating the website that domain name is " * * * A " is 33.33%.

The confidence level of interest point data in website is determined according to accuracy, at this point, confidence level characterization can not reliability.

In one example, accuracy directly can be assigned to confidence level；

In another example, weight can be configured for error rate in different time periods, which decays according to the time, The error rate for configuring weight is calculated into confidence level according to modes such as summations.

When confidence level be lower than preset second threshold when, show the source POI of the website be it is incredible, forbid from the net The webpage stood extracts interest point data.

Referring to Fig. 3, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown The step flow chart of embodiment 3, can specifically include following steps:

Step 301, interest point data is extracted in webpage；

Step 302, the second mesh of correct first object interest point data and mistake is identified from the interest point data Mark interest point data；

Step 303, the first quantity and the second target of the first object interest point data of the same website of statistics ownership are emerging Second quantity of interesting point data；

Step 304, the confidence of interest point data in the website is determined according to first quantity and second quantity Degree.

In an alternate embodiment of the present invention where, this method can also include the following steps:

Step 305, when the confidence level is higher than preset first threshold, allow to extract interest from the webpage of the website Point data；

Step 306, when the confidence level is lower than preset second threshold, forbid extracting interest from the webpage of the website Point data.

In an alternate embodiment of the present invention where, step 301 may include following sub-step:

Sub-step S31 searches the template for webpage configuration；

Sub-step S32, in the webpage, interest point data is extracted in the position according to template instruction.

In an alternate embodiment of the present invention where, the interest point data includes interest point name；Step 302 can wrap Include following sub-step:

The interest point name for identifying same target is set interest point name set by sub-step S41；

Sub-step S42 identifies correct first object interest point name and mistake from the interest point name set Second target interest point name；

Sub-step S43 determines that interest point data belonging to the first object interest point name is correct first object Interest point data；

Sub-step S44 determines that interest point data belonging to the second target interest point name is the second target of mistake Interest point data.

In an alternate embodiment of the present invention where, the interest point data includes interest dot address；Sub-step S41 is into one Step may include following sub-step:

Sub-step S411 judges whether the interest dot address is same or similar；If so, executing sub-step S412；

The interest point name of the point of interest address information is set interest point name set by sub-step S412.

In an alternate embodiment of the present invention where, sub-step S42 can further include following sub-step:

Sub-step S421, the interest point name in the interest point name set choose keyword；

Sub-step S422 identifies that correct first object interest is called the roll according to the keyword from the interest point name Claim the second target interest point name with mistake.

In an alternate embodiment of the present invention where, sub-step S421 can further include following sub-step:

Sub-step S4211, in the interest point name set interest point name carry out word segmentation processing, obtain one or Multiple participles；

Sub-step S4212 searches first word frequency of the participle in preset interest point set；

Sub-step S4213, by the X participle that the first word frequency is minimum in the same interest point name, as the point of interest The keyword of title, wherein X is positive integer.

In an alternate embodiment of the present invention where, sub-step S421 can also include further following sub-step:

Sub-step S4214 removes the participle when the participle is matched with preset address date.

In an alternate embodiment of the present invention where, sub-step S422 can further include following sub-step:

Sub-step S4221 calculates second word frequency of the keyword in the interest point set；

Sub-step S4222, using interest point name belonging to the highest Y keyword of second word frequency as correct One target interest point name；

Sub-step S4223, using interest point name belonging to Z minimum keyword of second word frequency as the of mistake Two target interest point names, wherein Y, Z are positive integer.

In an alternate embodiment of the present invention where, the interest point data includes URL；Step 303 may include as follows Sub-step:

Sub-step S51 searches the corresponding URL of the first object interest point data and the second target interest point data Corresponding URL；

Sub-step S52, when the corresponding URL of the first object interest point data belongs to the domain name of the same website, system Count the first quantity of the first object interest point data；

Sub-step S53, when the corresponding URL of the second target interest point data belongs to the domain name of the same website, system Count the second quantity of the second target interest point data.

In an alternate embodiment of the present invention where, step 304 may include following sub-step:

Sub-step S61 calculates accuracy according to first quantity；

Sub-step S62 calculates error rate according to second quantity；

Sub-step S63 determines the confidence level of interest point data in the website according to the accuracy and the error rate.

In embodiments of the present invention, due to substantially similar to the application of embodiment of the method 1,2, so the comparison of description is simple Single, related place illustrates that the embodiment of the present invention is not described in detail herein referring to the part of embodiment of the method 1,2.

For embodiment of the method, for simple description, therefore, it is stated as a series of action combinations, but this field Technical staff should be aware of, and embodiment of that present invention are not limited by the describe sequence of actions, because implementing according to the present invention Example, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know that, specification Described in embodiment belong to preferred embodiment, the actions involved are not necessarily necessary for embodiments of the present invention.

Referring to Fig. 4, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown The structural block diagram for determining Installation practice 1, can specifically include following module:

Interest point data extraction module 401, suitable for extracting interest point data in webpage；

Correct interest point data identification module 402, suitable for identifying that correct first object is emerging from the interest point data Interesting point data；

Correct number statistical module 403, suitable for counting the first of the first object interest point data for belonging to the same website Quantity；

Credible confidence determination module 404, suitable for determining interest point data in the website according to first quantity Confidence level.

In an alternate embodiment of the present invention where, which can also include following module:

Allow extraction module, is suitable for allowing the net from the website when the confidence level is higher than preset first threshold Page extracts interest point data.

In an alternate embodiment of the present invention where, institute's interest point data extraction module 401 can be adapted to:

Search the template for webpage configuration；

In an alternate embodiment of the present invention where, the interest point data includes interest point name；

The correct interest point data identification module 402 can be adapted to:

Correct first object interest point name is identified from the interest point name set；

Determine that interest point data belonging to the first object interest point name is correct first object interest point data.

In an alternate embodiment of the present invention where, the interest point data includes interest dot address；

The correct interest point data identification module 402 can be adapted to:

In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:

Interest point name in the interest point name set chooses keyword；

Correct first object interest point name is identified from the interest point name according to the keyword.

Search first word frequency of the participle in preset interest point set；

Calculate second word frequency of the keyword in the interest point set；

Using interest point name belonging to the highest Y keyword of second word frequency as correct first object point of interest Title, wherein Y is positive integer.

In an alternate embodiment of the present invention where, the interest point data includes URL；

The correct number statistical module 403 can be adapted to:

Search the corresponding URL of the first object interest point data；

When the corresponding URL of the first object interest point data belongs to the domain name of the same website, statistics described first First quantity of target interest point data.

In an alternate embodiment of the present invention where, the credible confidence determination module 404 can be adapted to:

Accuracy is calculated according to first quantity；

The confidence level of interest point data in the website is determined according to the accuracy.

Referring to Fig. 5, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown The structural block diagram for determining Installation practice 2, can specifically include following module:

Interest point data extraction module 501, suitable for extracting interest point data in webpage；

Mistake interest point data identification module 502, the second target suitable for identifying mistake from the interest point data are emerging Interesting point data；

Number of errors statistical module 503, suitable for counting the second of the second target interest point data for belonging to the same website Quantity；

Insincere confidence determination module 504, suitable for determining interest point data in the website according to second quantity Confidence level.

In an alternate embodiment of the present invention where, institute's interest point data extraction module 501 can be adapted to:

Search the template for webpage configuration；

The mistake interest point data identification module 502 can be adapted to:

In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:

Interest point name in the interest point name set chooses keyword；

Search first word frequency of the participle in preset interest point set；

Calculate second word frequency of the keyword in the interest point set；

The number of errors statistical module 503 can be adapted to:

Search the corresponding URL of the second target interest point data；

In an alternate embodiment of the present invention where, the insincere confidence determination module 504 can be adapted to:

Error rate is calculated according to second quantity；

Referring to Fig. 6, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown The structural block diagram for determining Installation practice 3, can specifically include following module:

Interest point data extraction module 601, suitable for extracting interest point data in webpage；

Interest point data identification module 602, suitable for identifying correct first object point of interest from the interest point data Second target interest point data of data and mistake；

Quantity statistics module 603, suitable for counting the first quantity for belonging to the first object interest point data of the same website With the second quantity of the second target interest point data；

Confidence determination module 604, it is emerging in the website suitable for being determined according to first quantity and second quantity The confidence level of interesting point data.

Allow extraction module, is suitable for allowing the net from the website when the confidence level is higher than preset first threshold Page extracts interest point data；

In an alternate embodiment of the present invention where, institute's interest point data extraction module 601 can be adapted to:

Search the template for webpage configuration；

The interest point data identification module 602 can be adapted to:

Identify that the second target of correct first object interest point name and mistake is emerging from the interest point name set Interest point title；

Determine that interest point data belonging to the first object interest point name is correct first object interest point data；

The interest point data identification module 602 can be adapted to:

In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:

Interest point name in the interest point name set chooses keyword；

Correct first object interest point name and mistake are identified from the interest point name according to the keyword Second target interest point name.

Search first word frequency of the participle in preset interest point set；

Calculate second word frequency of the keyword in the interest point set；

Using interest point name belonging to the highest Y keyword of second word frequency as correct first object point of interest Title；

Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake Title, wherein Y, Z are positive integer.

The quantity statistics module 403 can be adapted to:

Search the corresponding URL of the first object interest point data and the corresponding URL of the second target interest point data；

When the corresponding URL of the first object interest point data belongs to the domain name of the same website, statistics described first First quantity of target interest point data；

In an alternate embodiment of the present invention where, the confidence determination module 604 can be adapted to:

Accuracy is calculated according to first quantity；

Error rate is calculated according to second quantity；

The confidence level of interest point data in the website is determined according to the accuracy and the error rate.

For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.

Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.

Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.

Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.

Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) realize the confidence of interest point data in website according to an embodiment of the present invention The some or all functions of some or all components in the judgement equipment of degree.The present invention is also implemented as executing Some or all device or device programs of method as described herein are (for example, computer program and computer journey Sequence product).It is such to realize that program of the invention can store on a computer-readable medium, either can have one or The form of multiple signals.Such signal can be downloaded from an internet website to obtain, be perhaps provided on the carrier signal or It is provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Claims

1. A method for determining the confidence of point-of-interest data in a website, comprising:

Extract POI data from web pages;

Identifying erroneous second target POI data from the POI data;

Counting the second quantity of second target POI data belonging to the same website;

determining the confidence level of the point-of-interest data in the website according to the second quantity;

Wherein, the point of interest data includes a URL;

The step of calculating the second quantity of the second target POI data belonging to the same website includes:

Find the URL corresponding to the second target POI data;

When the URL corresponding to the second target POI data belongs to the domain name of the same website, count the second quantity of the second target POI data;

Wherein, the step of determining the confidence level of the POI data in the website according to the second quantity includes:

calculating an error rate according to the second number;

The confidence of the point-of-interest data in the website is determined according to the correct rate.

2. The method of claim 1, further comprising:

When the confidence level is lower than the preset second threshold, the extraction of POI data from the webpage of the website is prohibited.

3. The method of claim 1, wherein the step of extracting POI data from the webpage comprises:

Find templates configured for web pages;

In the webpage, the point of interest data is extracted according to the position indicated by the template.

4. The method of any one of claims 1-3, wherein the POI data comprises a POI name;

The step of identifying wrong second target POI data from the POI data includes:

Set the POI names that identify the same object as the POI name collection;

identifying an incorrect second target POI name from the set of POI names;

It is determined that the POI data to which the second target POI name belongs is incorrect second target POI data.

5. The method of claim 4, wherein the POI data comprises a POI address;

The step of setting the POI name identifying the same object as the POI name set includes:

Determine whether the POI addresses are the same or similar; if so, set the POI names associated with the POI addresses as the POI name set.

6. The method of claim 4, wherein the step of identifying an incorrect second target POI name from the set of POI names comprises:

Select keywords from the POI names in the POI name set;

A wrong second target POI name is identified from the POI names according to the keyword.

7. The method of claim 6, wherein the step of selecting keywords from the POI names in the POI name set comprises:

Perform word segmentation processing on the POI names in the POI name set to obtain one or more segmentations;

Find the first word frequency of the segmented word in a preset set of interest points;

The X participles with the lowest first word frequency in the same POI name are used as the keywords of the POI name, where X is a positive integer.

8. The method of claim 7, wherein the step of selecting keywords from the POI names in the POI name set further comprises:

When the segmented word matches the preset address data, the segmented word is removed.

9. The method of claim 6, wherein the step of identifying a wrong second target POI name from the POI name according to the keyword comprises:

calculating the second word frequency of the keyword in the set of interest points;

The POI names to which the Z keywords with the lowest second word frequency belong are used as the wrong second target POI names, where Z is a positive integer.

10. A device for determining the confidence of point-of-interest data in a website, comprising:

An interest point data extraction module, which is suitable for extracting interest point data from a web page;

an incorrect POI data identification module, adapted to identify incorrect second target POI data from the POI data;

The error quantity statistics module is suitable for counting the second quantity of the second target POI data belonging to the same website;

an untrustworthy confidence level determination module, adapted to determine the confidence level of the POI data in the website according to the second quantity;

Wherein, the point of interest data includes a URL;

The error number statistics module is also adapted to:

Find the URL corresponding to the second target POI data;

Wherein, the untrusted confidence level determination module is further adapted to:

calculating an error rate according to the second number;

11. The apparatus of claim 10, further comprising:

The extraction prohibition module is adapted to prohibit extraction of POI data from the webpage of the website when the confidence level is lower than a preset second threshold.

12. The apparatus of claim 10, wherein the point of interest data extraction module is further adapted to:

Find templates configured for web pages;

13. The apparatus according to any one of claims 10-12, wherein the POI data comprises a POI name;

The wrong point of interest data identification module is further adapted to:

Set the POI names that identify the same object as the POI name collection;

identifying an incorrect second target POI name from the set of POI names;

14. The apparatus of claim 13, wherein the point of interest data comprises a point of interest address;

The wrong point of interest data identification module is further adapted to:

15. The apparatus of claim 13, wherein the false POI data identification module is further adapted to:

Select keywords from the POI names in the POI name set;

16. The apparatus of claim 15, wherein the false POI data identification module is further adapted to:

17. The apparatus of claim 16, wherein the false POI data identification module is further adapted to:

18. The apparatus of claim 15, wherein the false POI data identification module is further adapted to: