[go: up one dir, main page]

CN105160032B - The determination method and device of the confidence level of interest point data in a kind of website - Google Patents

The determination method and device of the confidence level of interest point data in a kind of website Download PDF

Info

Publication number
CN105160032B
CN105160032B CN201510642636.0A CN201510642636A CN105160032B CN 105160032 B CN105160032 B CN 105160032B CN 201510642636 A CN201510642636 A CN 201510642636A CN 105160032 B CN105160032 B CN 105160032B
Authority
CN
China
Prior art keywords
poi
data
interest point
interest
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510642636.0A
Other languages
Chinese (zh)
Other versions
CN105160032A (en
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510642636.0A priority Critical patent/CN105160032B/en
Publication of CN105160032A publication Critical patent/CN105160032A/en
Application granted granted Critical
Publication of CN105160032B publication Critical patent/CN105160032B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/29Geographical information databases
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Remote Sensing (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明实施例提供了一种网站中兴趣点数据的置信度的判定方法和装置,该方法包括:在网页中提取兴趣点数据;从所述兴趣点数据中识别错误的第二目标兴趣点数据;统计归属同一个网站的第二目标兴趣点数据的第二数量;根据所述第二数量确定所述网站中兴趣点数据的置信度。本发明实施例根据置信度禁止从这些不可信的POI数据来源抓取POI数据,抓取到的POI数据的正确性高,减少了计算机的系统资源和带宽资源的浪费,提高了POI数据抓取效率。

The embodiments of the present invention provide a method and device for determining the confidence of POI data in a website, the method includes: extracting POI data from a webpage; identifying wrong second target POI data from the POI data ; Counting the second quantity of the second target POI data belonging to the same website; determining the confidence level of the POI data in the website according to the second quantity. The embodiment of the present invention prohibits the capture of POI data from these untrusted POI data sources according to the confidence, the captured POI data has high accuracy, reduces the waste of computer system resources and bandwidth resources, and improves the capture of POI data. efficiency.

Description

The determination method and device of the confidence level of interest point data in a kind of website
Technical field
The present invention relates to the technical fields of computer disposal, more particularly to a kind of confidence level of interest point data in website The confidence level device of interest point data in method and a kind of website.
Background technique
Point of interest (Point of Interest, POI), and be properly termed as " information point ", it includes various information, Such as title, classification, latitude, longitude.
In GIS-Geographic Information System, a POI can be a house, a retail shop, a mailbox, a bus station Deng.
Traditional geographical information collection method needs ground mapping personnel to go acquisition one emerging using accurate instrument of surveying and mapping The longitude and latitude of interest point, then marks again.
Just because of the acquisition of POI data is a very time-consuming bothersome job, for a GIS-Geographic Information System, The quantity of POI is in the value that represent whole system to a certain degree.
In order to enrich GIS-Geographic Information System POI data quantity, POI data is excavated from webpage at present, is root mostly Suitable template is configured according to the structure of webpage, is extracted by template.
But user not necessarily goes to release news according to the regulation of webpage, so that being filled in these websites comprising POI Denounce a large amount of dirty data, is the POI data of mistake.
For example, a region of some websites agreement webpage is publication Business Name, still, some users may issue all Such as " five top 100 enterprises of the world " data are not a real POI titles.
If the POI data of these mistakes of subsequent applications carries out the operation such as navigate, the error rate of operation is high, causes resource unrestrained Take.
Also, computer grabs always the POI data of these mistakes, wastes system for computer resource and bandwidth resources, It is very low that POI data grabs efficiency.
Summary of the invention
In view of the above problems, it proposes on the present invention overcomes the above problem or at least be partially solved in order to provide one kind State the confidence level of interest point data in the confidence level method and a kind of corresponding website of interest point data in a kind of website of problem Device.
According to one aspect of the present invention, a kind of determination method of the confidence level of interest point data in website, packet are provided It includes:
Interest point data is extracted in webpage;
The second target interest point data of mistake is identified from the interest point data;
Statistics belongs to the second quantity of the second target interest point data of the same website;
The confidence level of interest point data in the website is determined according to second quantity.
Optionally, further includes:
When the confidence level is lower than preset second threshold, forbid extracting interest point data from the webpage of the website.
Optionally, described the step of interest point data is extracted in webpage, includes:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
Optionally, the interest point data includes interest point name;
It is described identified from the interest point data mistake the second target interest point data the step of include:
Interest point name set is set by the interest point name for identifying same target;
The second target interest point name of mistake is identified from the interest point name set;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
Optionally, the interest point data includes interest dot address;
Described the step of setting interest point name set for the interest point name for identifying same target includes:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information Title is set as interest point name set.
Optionally, the step of second target interest point name that mistake is identified from interest point name set packet It includes:
Interest point name in the interest point name set chooses keyword;
The second target interest point name of mistake is identified from the interest point name according to the keyword.
Optionally, the step of interest point name selection keyword in the interest point name set includes:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name, Wherein, X is positive integer.
Optionally, the step of interest point name in the interest point name set chooses keyword further include:
When the participle is matched with preset address date, the participle is removed.
Optionally, the second target interest for identifying mistake from the interest point name according to the keyword is called the roll The step of title includes:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake Title, wherein Z is positive integer.
Optionally, the interest point data includes URL;
The step of second quantity of the second target interest point data that the statistics belongs to same website includes:
Search the corresponding URL of the second target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second Second quantity of target interest point data.
Optionally, the step of confidence level of interest point data in the website is determined according to second quantity packet It includes:
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the error rate.
According to another aspect of the present invention, a kind of decision maker of the confidence level of interest point data in website, packet are provided It includes:
Interest point data extraction module, suitable for extracting interest point data in webpage;
Mistake interest point data identification module, suitable for identifying the second target point of interest of mistake from the interest point data Data;
Number of errors statistical module, suitable for counting the second number of the second target interest point data for belonging to the same website Amount;
Insincere confidence determination module, suitable for determining setting for interest point data in the website according to second quantity Reliability.
Optionally, further includes:
Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold Page extracts interest point data.
Optionally, institute's interest point data extraction module is further adapted for:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
Optionally, the interest point data includes interest point name;
The mistake interest point data identification module is further adapted for:
Interest point name set is set by the interest point name for identifying same target;
The second target interest point name of mistake is identified from the interest point name set;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
Optionally, the interest point data includes interest dot address;
The mistake interest point data identification module is further adapted for:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information Title is set as interest point name set.
Optionally, the wrong interest point data identification module is further adapted for:
Interest point name in the interest point name set chooses keyword;
The second target interest point name of mistake is identified from the interest point name according to the keyword.
Optionally, the wrong interest point data identification module is further adapted for:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name, Wherein, X is positive integer.
Optionally, the wrong interest point data identification module is further adapted for:
When the participle is matched with preset address date, the participle is removed.
Optionally, the wrong interest point data identification module is further adapted for:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake Title, wherein Z is positive integer.
Optionally, the interest point data includes URL;
The number of errors statistical module is further adapted for:
Search the corresponding URL of the second target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second Second quantity of target interest point data.
Optionally, the insincere confidence determination module is further adapted for:
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the error rate.
The embodiment of the present invention identifies the second target interest point data of mistake from the interest point data in webpage extraction, and The second quantity that statistics belongs to the second target interest point data of the same website determines the confidence level of interest point data in website, To reject the POI data of these mistakes in subsequent operation, the error rate of operation is reduced, is reduced resource waste.
In turn, forbidden grabbing POI data, the POI number grabbed from these incredible POI data sources according to confidence level According to correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.
The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.
Detailed description of the invention
By reading the following detailed description of the preferred embodiment, various other advantages and benefits are common for this field Technical staff will become clear.The drawings are only for the purpose of illustrating a preferred embodiment, and is not considered as to the present invention Limitation.And throughout the drawings, the same reference numbers will be used to refer to the same parts.In the accompanying drawings:
Fig. 1 shows the confidence level embodiment of the method for interest point data in a kind of website according to an embodiment of the invention 1 step flow chart;
Fig. 2 shows the confidence level embodiments of the method for interest point data in a kind of website according to an embodiment of the invention 2 step flow chart;
Fig. 3 shows the confidence level embodiment of the method for interest point data in a kind of website according to an embodiment of the invention 3 step flow chart;
Fig. 4 shows the decision maker of the confidence level of interest point data in a kind of website according to an embodiment of the invention The structural block diagram of embodiment 1;
Fig. 5 shows the decision maker of the confidence level of interest point data in a kind of website according to an embodiment of the invention The structural block diagram of embodiment 2;And
Fig. 6 shows the decision maker of the confidence level of interest point data in a kind of website according to an embodiment of the invention The structural block diagram of embodiment 3.
Specific embodiment
Exemplary embodiments of the present disclosure are described in more detail below with reference to accompanying drawings.Although showing the disclosure in attached drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here It is limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure It is fully disclosed to those skilled in the art.
Referring to Fig.1, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown The step flow chart of embodiment 1, can specifically include following steps:
Step 101, interest point data is extracted in webpage;
In embodiments of the present invention, crawler can first pass through the linking relationship between webpage in advance, grab the webpage of internet simultaneously It saves, the webpage of crawler capturing, which is stored in web database, forms a large amount of searching resource.
For there are more POI data and the regular webpages of POI data distribution tool, as user carries out food and drink, tourism The webpage in website commented on, the webpage etc. in map web site, can search the template for webpage configuration, in webpage In, interest point data is extracted in the position according to template instruction, so that a large amount of POI data is got, including associated emerging Interest point title, interest dot address, URL (Uniform Resource Locator, uniform resource locator) etc..
For example, the part structure of web page of some websites is as follows:
Wherein, " * * * " is domain name.
In the template of this website, interest point name can be extracted in the first row, can be extracted in last line Interest dot address.
By template, following interest point data is extracted in the webpage of different web sites:
Wherein, " * * * A " and " * * * B " is different domain names.
Step 102, correct first object interest point data is identified from the interest point data;
Correct first object interest point data, refers to the data for meeting point of interest specification in the embodiment of the present invention, including Correct title, address etc..
In an alternate embodiment of the present invention where, step 102 may include following sub-step:
The interest point name for identifying same target is set interest point name set by sub-step S11;
POI data generally can all identify an object, such as a house, a retail shop, a mailbox, a bus station Deng.
Since the accuracy of the address information of the object is generally relatively high, in embodiments of the present invention, it can pass through Interest dot address is normalized, judges whether interest dot address is same or similar;If so, by point of interest address information Interest point name is set as interest point name set.
For example, " three building, the permanent general merchandise in Yulin road Yu Yangfushi tide today hotel next door east ", " Yulin Yuyang District skin Shi Lujin Diurnal tide next door east three buildings the first sales departments of permanent general merchandise ", " 3 building, the permanent department store in Yulin south gate Yu Yang mouthful east " and " Yulin south Wholesale three buildings of the permanent general merchandise in doorway east ", can be true by normalization although this 4 interest dot addresses are not exactly the same in form The address for determining them is all " three building, the permanent department store in rate in Yuyang county east ".
I.e. associated " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch company " and " Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " is interest point name set.
Sub-step S12 identifies correct first object interest point name from the interest point name set;
In embodiments of the present invention, correct POI title can be screened by excavating the keyword of interest point name, i.e., First object interest point name.
In an alternate embodiment of the present invention where, sub-step S12 can further include following sub-step:
Sub-step S121, the interest point name in the interest point name set choose keyword;
In embodiments of the present invention, keyword can for comprising information content it is maximum, embody the word of interest point name feature.
In the concrete realization, word segmentation processing can be carried out to the interest point name in interest point name set, obtains one Or multiple participles;
First word frequency of the participle in preset interest point set is searched, which is in the webpage grabbed The quantity of the set of POI data, the POI data can be up to tens million of, which is according to tens million of POI data Title statistics.
It, can be using following one or more word segmentation processings in the embodiment of the present invention:
1, based on the participle of string matching: refer to the Chinese character string being analysed to according to certain strategy and one it is preset Entry in machine dictionary is matched, if finding some character string in dictionary, successful match (identifies a word).
2, the participle based on mark scanning or mark cutting: refer to and preferentially identify and be syncopated as one in character string to be analyzed Former character string can be divided into lesser go here and there and be come again into mechanical Chinese word segmentation by a little words for having obvious characteristic using these words as breakpoint, from And reduce matched error rate;Or combine participle and part-of-speech tagging, using grammatical category information abundant to participle decision Help is provided, and tests, adjust to word segmentation result in turn again in annotation process, to improve the accurate of cutting Rate.
3, based on the participle of understanding: referring to by allowing the understanding of computer mould personification distich, achieve the effect that identify word. Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.
4, based on the segmenting method of statistics: referring to, due to the frequency or probability energy of word co-occurrence adjacent with word in Chinese information Enough preferable confidence levels reflected into word, it is possible to unite to the frequency of each combinatorics on words of co-occurrence adjacent in corpus Meter calculates their information that appears alternatively, and calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information that appears alternatively can embody Chinese character Between marriage relation tightness degree.When tightness degree is higher than some threshold value, it can think that this word group may constitute one A word.
For example, can be segmented as follows with cutting for above-mentioned interest point name:
When the first word frequency is minimum, it includes information content it is generally maximum, then can be by the same interest point name X minimum participle of one word frequency, the keyword as interest point name, wherein X is positive integer.
For example, following keyword can be extracted for above-mentioned interest point name:
Interest point name Keyword
500 tops of the world enterprise The world
China Ping'an Insurance company Safety
Chinese safety Yulin branch company Safety
Yulin branch company, China Ping'an Insurance Co., Ltd. Branch Safety
Wherein, the first word frequency of the words such as " enterprise ", " company ", " branch company " is higher, and the information content for including is less, only indicates Business/company identity, directive property is indefinite, is not suitable for as keyword, and first word frequency of the words such as " safety " is more lower, includes Information content is more, i.e., common enterprise's abbreviation title, is suitable for as keyword.
It should be noted that the address dates such as the province in the whole nation, city, county (area), small towns, road can be obtained in advance, creation One address database.
When participle is matched with preset address date, for example, " China ", " Yulin " etc., it is invalid keyword, it can To remove the participle.
Sub-step S122 identifies that correct first object interest is called the roll according to the keyword from the interest point name Claim.
In the concrete realization, second word frequency of the keyword in interest point name set can be calculated, most by the second word frequency The work of interest point name belonging to Y high keyword is determined as correct target interest point name, wherein Y is positive integer.
For example, second word frequency in " world " is 1, second word frequency of " safety " for the keyword of above-mentioned interest point name It is 3, second word frequency of " safety " is higher, and " China Ping'an Insurance company ", " Chinese safety Yulin point public affairs belonging to it can be confirmed Department " and " Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " are correct first object interest point name.
Sub-step S13 determines that interest point data belonging to the first object interest point name is correct first object Interest point data.
When the title of POI is correct, it can be confirmed that the POI is correct POI.
Step 103, statistics belongs to the first quantity of the first object interest point data of the same website;
In practical applications, the corresponding URL of first object interest point data can be searched, when the first object point of interest When the corresponding URL of data belongs to the domain name of the same website, the first quantity of first object interest point data is counted.
For example, for the example of above-mentioned interest point data, " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " in The URL of state's safety Yulin branch company " belongs to the domain name " * * * A " of the same website, i.e., these interest point names belong to the same net It stands, the first quantity of the first object interest point data of this website is 2.
Step 104, the confidence level of interest point data in the website is determined according to first quantity.
In the concrete realization, accuracy, the i.e. ratio of the first quantity and total quantity can be calculated according to the first quantity, as above The accuracy for stating the website that domain name is " * * * A " is 66.67%.
The confidence level of interest point data in website is determined according to accuracy, at this point, confidence level characterizes confidence level.
In one example, accuracy directly can be assigned to confidence level;
In another example, weight can be configured for accuracy in different time periods, which decays according to the time, The accuracy for configuring weight is calculated into confidence level according to modes such as summations.
Certainly, the calculation of above-mentioned confidence level is intended only as example, in implementing the embodiments of the present invention, can be according to reality The calculation of other confidence levels is arranged in border situation, and the embodiments of the present invention are not limited thereto.In addition, in addition to above-mentioned confidence level Calculation outside, those skilled in the art can also according to actual needs use other confidence levels calculation, the present invention Embodiment is also without restriction to this.
When confidence level is higher than preset first threshold, show the source POI of the website be it is believable, allow from the website Webpage extract interest point data.
The embodiment of the present invention identifies correct first object interest point data from the interest point data in webpage extraction, and The first quantity that statistics belongs to the first object interest point data of the same website determines the confidence level of interest point data in website, To apply these correct POI datas in subsequent operation, the error rate of operation is reduced, is reduced resource waste.
In turn, allowed to grab POI data, the POI data grabbed from these believable POI data sources according to confidence level Correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.
Referring to Fig. 2, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown The step flow chart of embodiment 2, can specifically include following steps:
Step 201, interest point data is extracted in webpage;
In embodiments of the present invention, crawler can first pass through the linking relationship between webpage in advance, grab the webpage of internet simultaneously It saves, the webpage of crawler capturing, which is stored in web database, forms a large amount of searching resource.
For there are more POI data and the regular webpages of POI data distribution tool, as user carries out food and drink, tourism The webpage in website commented on, the webpage etc. in map web site, can search the template for webpage configuration, in webpage In, interest point data is extracted in the position according to template instruction, so that a large amount of POI data is got, including associated emerging Interest point title, interest dot address, URL (Uniform Resource Locator, uniform resource locator) etc..
For example, the part structure of web page of some websites is as follows:
Wherein, " * * * " is domain name.
In the template of this website, interest point name can be extracted in the first row, can be extracted in last line Interest dot address.
By template, following interest point data is extracted in the webpage of different web sites:
Wherein, " * * * A " and " * * * B " is different domain names.
Step 202, the second target interest point data of mistake is identified from the interest point data;
The second wrong target interest point data, refers to the data for not meeting point of interest specification in the embodiment of the present invention, packet Include title, the address etc. of mistake.
In an alternate embodiment of the present invention where, step 202 may include following sub-step:
The interest point name for identifying same target is set interest point name set by sub-step S21;
POI data generally can all identify an object, such as a house, a retail shop, a mailbox, a bus station Deng.
Since the accuracy of the address information of the object is generally relatively high, in embodiments of the present invention, it can pass through Interest dot address is normalized, judges whether interest dot address is same or similar;If so, by point of interest address information Interest point name is set as interest point name set.
For example, " three building, the permanent general merchandise in Yulin road Yu Yangfushi tide today hotel next door east ", " Yulin Yuyang District skin Shi Lujin Diurnal tide next door east three buildings the first sales departments of permanent general merchandise ", " 3 building, the permanent department store in Yulin south gate Yu Yang mouthful east " and " Yulin south Wholesale three buildings of the permanent general merchandise in doorway east ", can be true by normalization although this 4 interest dot addresses are not exactly the same in form The address for determining them is all " three building, the permanent department store in rate in Yuyang county east ".
I.e. associated " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " Chinese safety Yulin branch company " and " Yulin branch company, China Ping'an Insurance Co., Ltd. Branch " is interest point name set.
Sub-step S22 identifies the second target interest point name of mistake from the interest point name set;
In embodiments of the present invention, the POI title of mistake can be screened by excavating the keyword of interest point name, i.e., Second target interest point name.
In an alternate embodiment of the present invention where, sub-step S22 can further include following sub-step:
Sub-step S121, the interest point name in the interest point name set choose keyword;
In embodiments of the present invention, keyword can for comprising information content it is maximum, embody the word of interest point name feature.
In the concrete realization, word segmentation processing can be carried out to the interest point name in interest point name set, obtains one Or multiple participles;
First word frequency of the participle in preset interest point set is searched, which is in the webpage grabbed The quantity of the set of POI data, the POI data can be up to tens million of, which is according to tens million of POI data Title statistics.
It, can be using following one or more word segmentation processings in the embodiment of the present invention:
1, based on the participle of string matching: refer to the Chinese character string being analysed to according to certain strategy and one it is preset Entry in machine dictionary is matched, if finding some character string in dictionary, successful match (identifies a word).
2, the participle based on mark scanning or mark cutting: refer to and preferentially identify and be syncopated as one in character string to be analyzed Former character string can be divided into lesser go here and there and be come again into mechanical Chinese word segmentation by a little words for having obvious characteristic using these words as breakpoint, from And reduce matched error rate;Or combine participle and part-of-speech tagging, using grammatical category information abundant to participle decision Help is provided, and tests, adjust to word segmentation result in turn again in annotation process, to improve the accurate of cutting Rate.
3, based on the participle of understanding: referring to by allowing the understanding of computer mould personification distich, achieve the effect that identify word. Its basic thought is exactly to carry out syntax, semantic analysis while participle, handles ambiguity using syntactic information and semantic information Phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Coordination in master control part Under, participle subsystem can obtain the syntax and semantic information in relation to word, sentence etc. to judge segmentation ambiguity, i.e. its mould People is intended to the understanding process of sentence.This segmenting method is needed using a large amount of linguistry and information.
4, based on the segmenting method of statistics: referring to, due to the frequency or probability energy of word co-occurrence adjacent with word in Chinese information Enough preferable confidence levels reflected into word, it is possible to unite to the frequency of each combinatorics on words of co-occurrence adjacent in corpus Meter calculates their information that appears alternatively, and calculates the adjacent co-occurrence probabilities of two Chinese characters X, Y.The information that appears alternatively can embody Chinese character Between marriage relation tightness degree.When tightness degree is higher than some threshold value, it can think that this word group may constitute one A word.
For example, can be segmented as follows with cutting for above-mentioned interest point name:
When the first word frequency is minimum, it includes information content it is generally maximum, then can be by the same interest point name X minimum participle of one word frequency, the keyword as interest point name, wherein X is positive integer.
For example, following keyword can be extracted for above-mentioned interest point name:
Interest point name Keyword
500 tops of the world enterprise The world
China Ping'an Insurance company Safety
Chinese safety Yulin branch company Safety
Yulin branch company, China Ping'an Insurance Co., Ltd. Branch Safety
Wherein, the first word frequency of the words such as " enterprise ", " company ", " branch company " is higher, and the information content for including is less, only indicates Business/company identity, directive property is indefinite, is not suitable for as keyword, and first word frequency of the words such as " safety " is more lower, includes Information content is more, i.e., common enterprise's abbreviation title, is suitable for as keyword.
It should be noted that the address dates such as the province in the whole nation, city, county (area), small towns, road can be obtained in advance, creation One address database.
When participle is matched with preset address date, for example, " China ", " Yulin " etc., it is invalid keyword, it can To remove the participle.
Sub-step S222 identifies that the second target interest of mistake is called the roll according to the keyword from the interest point name Claim.
In the concrete realization, second word frequency of the keyword in interest point name set can be calculated, most by the second word frequency The work of interest point name belonging to Z low keyword is determined as correct target interest point name, wherein Z is positive integer.
For example, second word frequency in " world " is 1, second word frequency of " safety " for the keyword of above-mentioned interest point name It is 3, second word frequency in " world " is lower, can be confirmed that " 500 tops of the world enterprise " belonging to it is the second target interest of mistake Point title.
Sub-step S23 determines that interest point data belonging to the second target interest point name is the second target of mistake Interest point data.
When the Name Error of POI, it can be confirmed that the POI is the POI of mistake.
Step 203, statistics belongs to the second quantity of the second target interest point data of the same website;
In practical applications, the corresponding URL of the second target interest point data can be searched, when the second target point of interest When the corresponding URL of data belongs to the domain name of the same website, the second quantity of the second target interest point data is counted.
For example, for the example of above-mentioned interest point data, " 500 tops of the world enterprise ", " China Ping'an Insurance company ", " in The URL of state's safety Yulin branch company " belongs to the domain name " * * * A " of the same website, i.e., these interest point names belong to the same net It stands, the first quantity of the second target interest point data of this website is 1.
Step 204, the confidence level of interest point data in the website is determined according to second quantity.
In the concrete realization, can according to the second quantity calculate error rate, i.e., two and quantity and total quantity ratio, as above The error rate for stating the website that domain name is " * * * A " is 33.33%.
The confidence level of interest point data in website is determined according to accuracy, at this point, confidence level characterization can not reliability.
In one example, accuracy directly can be assigned to confidence level;
In another example, weight can be configured for error rate in different time periods, which decays according to the time, The error rate for configuring weight is calculated into confidence level according to modes such as summations.
Certainly, the calculation of above-mentioned confidence level is intended only as example, in implementing the embodiments of the present invention, can be according to reality The calculation of other confidence levels is arranged in border situation, and the embodiments of the present invention are not limited thereto.In addition, in addition to above-mentioned confidence level Calculation outside, those skilled in the art can also according to actual needs use other confidence levels calculation, the present invention Embodiment is also without restriction to this.
When confidence level be lower than preset second threshold when, show the source POI of the website be it is incredible, forbid from the net The webpage stood extracts interest point data.
The embodiment of the present invention identifies the second target interest point data of mistake from the interest point data in webpage extraction, and The second quantity that statistics belongs to the second target interest point data of the same website determines the confidence level of interest point data in website, To reject the POI data of these mistakes in subsequent operation, the error rate of operation is reduced, is reduced resource waste.
In turn, forbidden grabbing POI data, the POI number grabbed from these incredible POI data sources according to confidence level According to correctness it is high, reduce the waste of system for computer resource and bandwidth resources, improve POI data crawl efficiency.
Referring to Fig. 3, the confidence level method of interest point data in a kind of website according to an embodiment of the invention is shown The step flow chart of embodiment 3, can specifically include following steps:
Step 301, interest point data is extracted in webpage;
Step 302, the second mesh of correct first object interest point data and mistake is identified from the interest point data Mark interest point data;
Step 303, the first quantity and the second target of the first object interest point data of the same website of statistics ownership are emerging Second quantity of interesting point data;
Step 304, the confidence of interest point data in the website is determined according to first quantity and second quantity Degree.
In an alternate embodiment of the present invention where, this method can also include the following steps:
Step 305, when the confidence level is higher than preset first threshold, allow to extract interest from the webpage of the website Point data;
Step 306, when the confidence level is lower than preset second threshold, forbid extracting interest from the webpage of the website Point data.
In an alternate embodiment of the present invention where, step 301 may include following sub-step:
Sub-step S31 searches the template for webpage configuration;
Sub-step S32, in the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;Step 302 can wrap Include following sub-step:
The interest point name for identifying same target is set interest point name set by sub-step S41;
Sub-step S42 identifies correct first object interest point name and mistake from the interest point name set Second target interest point name;
Sub-step S43 determines that interest point data belonging to the first object interest point name is correct first object Interest point data;
Sub-step S44 determines that interest point data belonging to the second target interest point name is the second target of mistake Interest point data.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;Sub-step S41 is into one Step may include following sub-step:
Sub-step S411 judges whether the interest dot address is same or similar;If so, executing sub-step S412;
The interest point name of the point of interest address information is set interest point name set by sub-step S412.
In an alternate embodiment of the present invention where, sub-step S42 can further include following sub-step:
Sub-step S421, the interest point name in the interest point name set choose keyword;
Sub-step S422 identifies that correct first object interest is called the roll according to the keyword from the interest point name Claim the second target interest point name with mistake.
In an alternate embodiment of the present invention where, sub-step S421 can further include following sub-step:
Sub-step S4211, in the interest point name set interest point name carry out word segmentation processing, obtain one or Multiple participles;
Sub-step S4212 searches first word frequency of the participle in preset interest point set;
Sub-step S4213, by the X participle that the first word frequency is minimum in the same interest point name, as the point of interest The keyword of title, wherein X is positive integer.
In an alternate embodiment of the present invention where, sub-step S421 can also include further following sub-step:
Sub-step S4214 removes the participle when the participle is matched with preset address date.
In an alternate embodiment of the present invention where, sub-step S422 can further include following sub-step:
Sub-step S4221 calculates second word frequency of the keyword in the interest point set;
Sub-step S4222, using interest point name belonging to the highest Y keyword of second word frequency as correct One target interest point name;
Sub-step S4223, using interest point name belonging to Z minimum keyword of second word frequency as the of mistake Two target interest point names, wherein Y, Z are positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;Step 303 may include as follows Sub-step:
Sub-step S51 searches the corresponding URL of the first object interest point data and the second target interest point data Corresponding URL;
Sub-step S52, when the corresponding URL of the first object interest point data belongs to the domain name of the same website, system Count the first quantity of the first object interest point data;
Sub-step S53, when the corresponding URL of the second target interest point data belongs to the domain name of the same website, system Count the second quantity of the second target interest point data.
In an alternate embodiment of the present invention where, step 304 may include following sub-step:
Sub-step S61 calculates accuracy according to first quantity;
Sub-step S62 calculates error rate according to second quantity;
Sub-step S63 determines the confidence level of interest point data in the website according to the accuracy and the error rate.
In embodiments of the present invention, due to substantially similar to the application of embodiment of the method 1,2, so the comparison of description is simple Single, related place illustrates that the embodiment of the present invention is not described in detail herein referring to the part of embodiment of the method 1,2.
For embodiment of the method, for simple description, therefore, it is stated as a series of action combinations, but this field Technical staff should be aware of, and embodiment of that present invention are not limited by the describe sequence of actions, because implementing according to the present invention Example, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know that, specification Described in embodiment belong to preferred embodiment, the actions involved are not necessarily necessary for embodiments of the present invention.
Referring to Fig. 4, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown The structural block diagram for determining Installation practice 1, can specifically include following module:
Interest point data extraction module 401, suitable for extracting interest point data in webpage;
Correct interest point data identification module 402, suitable for identifying that correct first object is emerging from the interest point data Interesting point data;
Correct number statistical module 403, suitable for counting the first of the first object interest point data for belonging to the same website Quantity;
Credible confidence determination module 404, suitable for determining interest point data in the website according to first quantity Confidence level.
In an alternate embodiment of the present invention where, which can also include following module:
Allow extraction module, is suitable for allowing the net from the website when the confidence level is higher than preset first threshold Page extracts interest point data.
In an alternate embodiment of the present invention where, institute's interest point data extraction module 401 can be adapted to:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;
The correct interest point data identification module 402 can be adapted to:
Interest point name set is set by the interest point name for identifying same target;
Correct first object interest point name is identified from the interest point name set;
Determine that interest point data belonging to the first object interest point name is correct first object interest point data.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;
The correct interest point data identification module 402 can be adapted to:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information Title is set as interest point name set.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
Interest point name in the interest point name set chooses keyword;
Correct first object interest point name is identified from the interest point name according to the keyword.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name, Wherein, X is positive integer.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
When the participle is matched with preset address date, the participle is removed.
In an alternate embodiment of the present invention where, the correct interest point data identification module 402 can be adapted to:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to the highest Y keyword of second word frequency as correct first object point of interest Title, wherein Y is positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;
The correct number statistical module 403 can be adapted to:
Search the corresponding URL of the first object interest point data;
When the corresponding URL of the first object interest point data belongs to the domain name of the same website, statistics described first First quantity of target interest point data.
In an alternate embodiment of the present invention where, the credible confidence determination module 404 can be adapted to:
Accuracy is calculated according to first quantity;
The confidence level of interest point data in the website is determined according to the accuracy.
Referring to Fig. 5, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown The structural block diagram for determining Installation practice 2, can specifically include following module:
Interest point data extraction module 501, suitable for extracting interest point data in webpage;
Mistake interest point data identification module 502, the second target suitable for identifying mistake from the interest point data are emerging Interesting point data;
Number of errors statistical module 503, suitable for counting the second of the second target interest point data for belonging to the same website Quantity;
Insincere confidence determination module 504, suitable for determining interest point data in the website according to second quantity Confidence level.
In an alternate embodiment of the present invention where, which can also include following module:
Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold Page extracts interest point data.
In an alternate embodiment of the present invention where, institute's interest point data extraction module 501 can be adapted to:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;
The mistake interest point data identification module 502 can be adapted to:
Interest point name set is set by the interest point name for identifying same target;
The second target interest point name of mistake is identified from the interest point name set;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;
The mistake interest point data identification module 502 can be adapted to:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information Title is set as interest point name set.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
Interest point name in the interest point name set chooses keyword;
The second target interest point name of mistake is identified from the interest point name according to the keyword.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name, Wherein, X is positive integer.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
When the participle is matched with preset address date, the participle is removed.
In an alternate embodiment of the present invention where, the wrong interest point data identification module 502 can be adapted to:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake Title, wherein Z is positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;
The number of errors statistical module 503 can be adapted to:
Search the corresponding URL of the second target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second Second quantity of target interest point data.
In an alternate embodiment of the present invention where, the insincere confidence determination module 504 can be adapted to:
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the error rate.
Referring to Fig. 6, sentencing for the confidence level of interest point data in a kind of website according to an embodiment of the invention is shown The structural block diagram for determining Installation practice 3, can specifically include following module:
Interest point data extraction module 601, suitable for extracting interest point data in webpage;
Interest point data identification module 602, suitable for identifying correct first object point of interest from the interest point data Second target interest point data of data and mistake;
Quantity statistics module 603, suitable for counting the first quantity for belonging to the first object interest point data of the same website With the second quantity of the second target interest point data;
Confidence determination module 604, it is emerging in the website suitable for being determined according to first quantity and second quantity The confidence level of interesting point data.
In an alternate embodiment of the present invention where, which can also include following module:
Allow extraction module, is suitable for allowing the net from the website when the confidence level is higher than preset first threshold Page extracts interest point data;
Forbid extraction module, is suitable for forbidding the net from the website when the confidence level is lower than preset second threshold Page extracts interest point data.
In an alternate embodiment of the present invention where, institute's interest point data extraction module 601 can be adapted to:
Search the template for webpage configuration;
In the webpage, interest point data is extracted in the position according to template instruction.
In an alternate embodiment of the present invention where, the interest point data includes interest point name;
The interest point data identification module 602 can be adapted to:
Interest point name set is set by the interest point name for identifying same target;
Identify that the second target of correct first object interest point name and mistake is emerging from the interest point name set Interest point title;
Determine that interest point data belonging to the first object interest point name is correct first object interest point data;
Determine that interest point data belonging to the second target interest point name is the second target interest point data of mistake.
In an alternate embodiment of the present invention where, the interest point data includes interest dot address;
The interest point data identification module 602 can be adapted to:
Judge whether the interest dot address is same or similar;If so, by the point of interest of the point of interest address information Title is set as interest point name set.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
Interest point name in the interest point name set chooses keyword;
Correct first object interest point name and mistake are identified from the interest point name according to the keyword Second target interest point name.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
Word segmentation processing is carried out to the interest point name in the interest point name set, obtains one or more participles;
Search first word frequency of the participle in preset interest point set;
By the first word frequency is minimum in the same interest point name X participle, as the keyword of the interest point name, Wherein, X is positive integer.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
When the participle is matched with preset address date, the participle is removed.
In an alternate embodiment of the present invention where, the interest point data identification module 602 can be adapted to:
Calculate second word frequency of the keyword in the interest point set;
Using interest point name belonging to the highest Y keyword of second word frequency as correct first object point of interest Title;
Using interest point name belonging to Z minimum keyword of second word frequency as the second target point of interest of mistake Title, wherein Y, Z are positive integer.
In an alternate embodiment of the present invention where, the interest point data includes URL;
The quantity statistics module 403 can be adapted to:
Search the corresponding URL of the first object interest point data and the corresponding URL of the second target interest point data;
When the corresponding URL of the first object interest point data belongs to the domain name of the same website, statistics described first First quantity of target interest point data;
When the corresponding URL of the second target interest point data belongs to the domain name of the same website, statistics described second Second quantity of target interest point data.
In an alternate embodiment of the present invention where, the confidence determination module 604 can be adapted to:
Accuracy is calculated according to first quantity;
Error rate is calculated according to second quantity;
The confidence level of interest point data in the website is determined according to the accuracy and the error rate.
For device embodiment, since it is basically similar to the method embodiment, related so being described relatively simple Place illustrates referring to the part of embodiment of the method.
Algorithm and display are not inherently related to any particular computer, virtual system, or other device provided herein. Various general-purpose systems can also be used together with teachings based herein.As described above, it constructs required by this kind of system Structure be obvious.In addition, the present invention is also not directed to any particular programming language.It should be understood that can use various Programming language realizes summary of the invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.
Similarly, it should be understood that in order to simplify the disclosure and help to understand one or more of the various inventive aspects, Above in the description of exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the disclosed method should not be interpreted as reflecting the following intention: i.e. required to protect Shield the present invention claims features more more than feature expressly recited in each claim.More precisely, as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following specific embodiment are expressly incorporated in the specific embodiment, wherein each claim itself All as a separate embodiment of the present invention.
Those skilled in the art will understand that can be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more devices different from this embodiment.It can be the module or list in embodiment Member or component are combined into a module or unit or component, and furthermore they can be divided into multiple submodule or subelement or Sub-component.Other than such feature and/or at least some of process or unit exclude each other, it can use any Combination is to all features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed All process or units of what method or apparatus are combined.Unless expressly stated otherwise, this specification is (including adjoint power Benefit require, abstract and attached drawing) disclosed in each feature can carry out generation with an alternative feature that provides the same, equivalent, or similar purpose It replaces.
In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed Meaning one of can in any combination mode come using.
Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) realize the confidence of interest point data in website according to an embodiment of the present invention The some or all functions of some or all components in the judgement equipment of degree.The present invention is also implemented as executing Some or all device or device programs of method as described herein are (for example, computer program and computer journey Sequence product).It is such to realize that program of the invention can store on a computer-readable medium, either can have one or The form of multiple signals.Such signal can be downloaded from an internet website to obtain, be perhaps provided on the carrier signal or It is provided in any other form.
It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

Claims (18)

1.一种网站中兴趣点数据的置信度的判定方法,包括:1. A method for determining the confidence of point-of-interest data in a website, comprising: 在网页中提取兴趣点数据;Extract POI data from web pages; 从所述兴趣点数据中识别错误的第二目标兴趣点数据;Identifying erroneous second target POI data from the POI data; 统计归属同一个网站的第二目标兴趣点数据的第二数量;Counting the second quantity of second target POI data belonging to the same website; 根据所述第二数量确定所述网站中兴趣点数据的置信度;determining the confidence level of the point-of-interest data in the website according to the second quantity; 其中,所述兴趣点数据包括URL;Wherein, the point of interest data includes a URL; 所述统计归属同一个网站的第二目标兴趣点数据的第二数量的步骤包括:The step of calculating the second quantity of the second target POI data belonging to the same website includes: 查找所述第二目标兴趣点数据对应的URL;Find the URL corresponding to the second target POI data; 当所述第二目标兴趣点数据对应的URL属于同一个网站的域名时,统计所述第二目标兴趣点数据的第二数量;When the URL corresponding to the second target POI data belongs to the domain name of the same website, count the second quantity of the second target POI data; 其中,所述根据所述第二数量确定所述网站中兴趣点数据的置信度的步骤包括:Wherein, the step of determining the confidence level of the POI data in the website according to the second quantity includes: 依据所述第二数量计算错误率;calculating an error rate according to the second number; 按照正确率确定所述网站中兴趣点数据的置信度。The confidence of the point-of-interest data in the website is determined according to the correct rate. 2.如权利要求1所述的方法,其特征在于,还包括:2. The method of claim 1, further comprising: 当所述置信度低于预设的第二阈值时,禁止从所述网站的网页提取兴趣点数据。When the confidence level is lower than the preset second threshold, the extraction of POI data from the webpage of the website is prohibited. 3.如权利要求1所述的方法,其特征在于,所述在网页中提取兴趣点数据的步骤包括:3. The method of claim 1, wherein the step of extracting POI data from the webpage comprises: 查找针对网页配置的模板;Find templates configured for web pages; 在所述网页中,依据所述模板指示的位置提取兴趣点数据。In the webpage, the point of interest data is extracted according to the position indicated by the template. 4.如权利要求1-3任一项所述的方法,其特征在于,所述兴趣点数据包括兴趣点名称;4. The method of any one of claims 1-3, wherein the POI data comprises a POI name; 所述从所述兴趣点数据中识别错误的第二目标兴趣点数据的步骤包括:The step of identifying wrong second target POI data from the POI data includes: 将标识同一对象的兴趣点名称设置为兴趣点名称集合;Set the POI names that identify the same object as the POI name collection; 从所述兴趣点名称集合中识别错误的第二目标兴趣点名称;identifying an incorrect second target POI name from the set of POI names; 确定所述第二目标兴趣点名称所属的兴趣点数据为错误的第二目标兴趣点数据。It is determined that the POI data to which the second target POI name belongs is incorrect second target POI data. 5.如权利要求4所述的方法,其特征在于,所述兴趣点数据包括兴趣点地址;5. The method of claim 4, wherein the POI data comprises a POI address; 所述将标识同一对象的兴趣点名称设置为兴趣点名称集合的步骤包括:The step of setting the POI name identifying the same object as the POI name set includes: 判断所述兴趣点地址是否相同或相似;若是,则将所述兴趣点地址关联的兴趣点名称设置为兴趣点名称集合。Determine whether the POI addresses are the same or similar; if so, set the POI names associated with the POI addresses as the POI name set. 6.如权利要求4所述的方法,其特征在于,所述从所述兴趣点名称集合中识别错误的第二目标兴趣点名称的步骤包括:6. The method of claim 4, wherein the step of identifying an incorrect second target POI name from the set of POI names comprises: 在所述兴趣点名称集合中的兴趣点名称选取关键词;Select keywords from the POI names in the POI name set; 依据所述关键词从所述兴趣点名称中识别错误的第二目标兴趣点名称。A wrong second target POI name is identified from the POI names according to the keyword. 7.如权利要求6所述的方法,其特征在于,所述在所述兴趣点名称集合中的兴趣点名称选取关键词的步骤包括:7. The method of claim 6, wherein the step of selecting keywords from the POI names in the POI name set comprises: 对所述兴趣点名称集合中的兴趣点名称进行分词处理,获得一个或多个分词;Perform word segmentation processing on the POI names in the POI name set to obtain one or more segmentations; 查找所述分词在预设的兴趣点集合中的第一词频;Find the first word frequency of the segmented word in a preset set of interest points; 将同一个兴趣点名称中第一词频最低的X个分词,作为所述兴趣点名称的关键词,其中,X为正整数。The X participles with the lowest first word frequency in the same POI name are used as the keywords of the POI name, where X is a positive integer. 8.如权利要求7所述的方法,其特征在于,所述在所述兴趣点名称集合中的兴趣点名称选取关键词的步骤还包括:8. The method of claim 7, wherein the step of selecting keywords from the POI names in the POI name set further comprises: 当所述分词与预设的地址数据匹配时,移除所述分词。When the segmented word matches the preset address data, the segmented word is removed. 9.如权利要求6所述的方法,其特征在于,所述依据所述关键词从所述兴趣点名称中识别错误的第二目标兴趣点名称的步骤包括:9. The method of claim 6, wherein the step of identifying a wrong second target POI name from the POI name according to the keyword comprises: 计算所述关键词在所述兴趣点集合中的第二词频;calculating the second word frequency of the keyword in the set of interest points; 将所述第二词频最低的Z个关键词所属的兴趣点名称作为错误的第二目标兴趣点名称,其中,Z为正整数。The POI names to which the Z keywords with the lowest second word frequency belong are used as the wrong second target POI names, where Z is a positive integer. 10.一种网站中兴趣点数据的置信度的判定装置,包括:10. A device for determining the confidence of point-of-interest data in a website, comprising: 兴趣点数据提取模块,适于在网页中提取兴趣点数据;An interest point data extraction module, which is suitable for extracting interest point data from a web page; 错误兴趣点数据识别模块,适于从所述兴趣点数据中识别错误的第二目标兴趣点数据;an incorrect POI data identification module, adapted to identify incorrect second target POI data from the POI data; 错误数量统计模块,适于统计归属同一个网站的第二目标兴趣点数据的第二数量;The error quantity statistics module is suitable for counting the second quantity of the second target POI data belonging to the same website; 不可信置信度确定模块,适于根据所述第二数量确定所述网站中兴趣点数据的置信度;an untrustworthy confidence level determination module, adapted to determine the confidence level of the POI data in the website according to the second quantity; 其中,所述兴趣点数据包括URL;Wherein, the point of interest data includes a URL; 所述错误数量统计模块还适于:The error number statistics module is also adapted to: 查找所述第二目标兴趣点数据对应的URL;Find the URL corresponding to the second target POI data; 当所述第二目标兴趣点数据对应的URL属于同一个网站的域名时,统计所述第二目标兴趣点数据的第二数量;When the URL corresponding to the second target POI data belongs to the domain name of the same website, count the second quantity of the second target POI data; 其中,所述不可信置信度确定模块还适于:Wherein, the untrusted confidence level determination module is further adapted to: 依据所述第二数量计算错误率;calculating an error rate according to the second number; 按照正确率确定所述网站中兴趣点数据的置信度。The confidence of the point-of-interest data in the website is determined according to the correct rate. 11.如权利要求10所述的装置,其特征在于,还包括:11. The apparatus of claim 10, further comprising: 禁止提取模块,适于在所述置信度低于预设的第二阈值时,禁止从所述网站的网页提取兴趣点数据。The extraction prohibition module is adapted to prohibit extraction of POI data from the webpage of the website when the confidence level is lower than a preset second threshold. 12.如权利要求10所述的装置,其特征在于,所兴趣点数据提取模块还适于:12. The apparatus of claim 10, wherein the point of interest data extraction module is further adapted to: 查找针对网页配置的模板;Find templates configured for web pages; 在所述网页中,依据所述模板指示的位置提取兴趣点数据。In the webpage, the point of interest data is extracted according to the position indicated by the template. 13.如权利要求10-12任一项所述的装置,其特征在于,所述兴趣点数据包括兴趣点名称;13. The apparatus according to any one of claims 10-12, wherein the POI data comprises a POI name; 所述错误兴趣点数据识别模块还适于:The wrong point of interest data identification module is further adapted to: 将标识同一对象的兴趣点名称设置为兴趣点名称集合;Set the POI names that identify the same object as the POI name collection; 从所述兴趣点名称集合中识别错误的第二目标兴趣点名称;identifying an incorrect second target POI name from the set of POI names; 确定所述第二目标兴趣点名称所属的兴趣点数据为错误的第二目标兴趣点数据。It is determined that the POI data to which the second target POI name belongs is incorrect second target POI data. 14.如权利要求13所述的装置,其特征在于,所述兴趣点数据包括兴趣点地址;14. The apparatus of claim 13, wherein the point of interest data comprises a point of interest address; 所述错误兴趣点数据识别模块还适于:The wrong point of interest data identification module is further adapted to: 判断所述兴趣点地址是否相同或相似;若是,则将所述兴趣点地址关联的兴趣点名称设置为兴趣点名称集合。Determine whether the POI addresses are the same or similar; if so, set the POI names associated with the POI addresses as the POI name set. 15.如权利要求13所述的装置,其特征在于,所述错误兴趣点数据识别模块还适于:15. The apparatus of claim 13, wherein the false POI data identification module is further adapted to: 在所述兴趣点名称集合中的兴趣点名称选取关键词;Select keywords from the POI names in the POI name set; 依据所述关键词从所述兴趣点名称中识别错误的第二目标兴趣点名称。A wrong second target POI name is identified from the POI names according to the keyword. 16.如权利要求15所述的装置,其特征在于,所述错误兴趣点数据识别模块还适于:16. The apparatus of claim 15, wherein the false POI data identification module is further adapted to: 对所述兴趣点名称集合中的兴趣点名称进行分词处理,获得一个或多个分词;Perform word segmentation processing on the POI names in the POI name set to obtain one or more segmentations; 查找所述分词在预设的兴趣点集合中的第一词频;Find the first word frequency of the segmented word in a preset set of interest points; 将同一个兴趣点名称中第一词频最低的X个分词,作为所述兴趣点名称的关键词,其中,X为正整数。The X participles with the lowest first word frequency in the same POI name are used as the keywords of the POI name, where X is a positive integer. 17.如权利要求16所述的装置,其特征在于,所述错误兴趣点数据识别模块还适于:17. The apparatus of claim 16, wherein the false POI data identification module is further adapted to: 当所述分词与预设的地址数据匹配时,移除所述分词。When the segmented word matches the preset address data, the segmented word is removed. 18.如权利要求15所述的装置,其特征在于,所述错误兴趣点数据识别模块还适于:18. The apparatus of claim 15, wherein the false POI data identification module is further adapted to: 计算所述关键词在所述兴趣点集合中的第二词频;calculating the second word frequency of the keyword in the set of interest points; 将所述第二词频最低的Z个关键词所属的兴趣点名称作为错误的第二目标兴趣点名称,其中,Z为正整数。The POI names to which the Z keywords with the lowest second word frequency belong are used as the wrong second target POI names, where Z is a positive integer.
CN201510642636.0A 2015-09-30 2015-09-30 The determination method and device of the confidence level of interest point data in a kind of website Active CN105160032B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510642636.0A CN105160032B (en) 2015-09-30 2015-09-30 The determination method and device of the confidence level of interest point data in a kind of website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510642636.0A CN105160032B (en) 2015-09-30 2015-09-30 The determination method and device of the confidence level of interest point data in a kind of website

Publications (2)

Publication Number Publication Date
CN105160032A CN105160032A (en) 2015-12-16
CN105160032B true CN105160032B (en) 2019-05-31

Family

ID=54800888

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510642636.0A Active CN105160032B (en) 2015-09-30 2015-09-30 The determination method and device of the confidence level of interest point data in a kind of website

Country Status (1)

Country Link
CN (1) CN105160032B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107590165B (en) * 2016-07-08 2021-10-08 阿里巴巴(中国)有限公司 Confidence coefficient setting method, equipment and server
CN109241208B (en) * 2017-07-10 2022-05-27 阿里巴巴集团控股有限公司 Address positioning method, address monitoring method, information processing method and device

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN104077295A (en) * 2013-03-27 2014-10-01 百度在线网络技术(北京)有限公司 Data label mining method and data label mining system
CN104572956A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 System and method for confirming POI information effectiveness
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725407B2 (en) * 2009-11-09 2014-05-13 United Parcel Service Of America, Inc. Enhanced location information for points of interest

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102841920A (en) * 2012-06-30 2012-12-26 北京百度网讯科技有限公司 Method and device for extracting webpage frame information
CN104077295A (en) * 2013-03-27 2014-10-01 百度在线网络技术(北京)有限公司 Data label mining method and data label mining system
CN104572956A (en) * 2014-12-29 2015-04-29 北京奇虎科技有限公司 System and method for confirming POI information effectiveness
CN104699835A (en) * 2015-03-31 2015-06-10 北京奇虎科技有限公司 Method and device used for determining webpages including POI (point of interest) data

Also Published As

Publication number Publication date
CN105160032A (en) 2015-12-16

Similar Documents

Publication Publication Date Title
CN111026937B (en) Method, device and equipment for extracting POI name and computer storage medium
CN102831121B (en) Method and system for extracting webpage information
CN104572955A (en) A system and method for determining POI names based on clustering
CN103514234B (en) A kind of page info extracting method and device
US8682882B2 (en) System and method for automatically identifying classified websites
CN104573028A (en) Intelligent question-answer implementing method and system
CN107832325B (en) POI data verification method and equipment
CN108287843A (en) A kind of method and apparatus and navigation equipment of interest point information retrieval
CN107203526B (en) Query string semantic demand analysis method and device
Nesi et al. Geographical localization of web domains and organization addresses recognition by employing natural language processing, Pattern Matching and clustering
CN109344355A (en) Automatic returning detection and Block- matching adaptive approach and device for Web evolution
WO2019227581A1 (en) Interest point recognition method, apparatus, terminal device, and storage medium
CN106599215A (en) Question generation method and question generation system based on deep learning
JP2018537760A (en) Method and apparatus for account mapping based on address information
Li et al. A method based on an adaptive radius cylinder model for detecting pole-like objects in mobile laser scanning data
Ahlers et al. Location-based Web search
Nesi et al. Ge (o) lo (cator): Geographic information extraction from unstructured text data and web documents
KR101638535B1 (en) Method of detecting issue patten associated with user search word, server performing the same and storage medium storing the same
CN105279249B (en) A method and device for determining the confidence of point of interest data in a website
Kilic et al. Effects of reverse geocoding on OpenStreetMap tag quality assessment
CN105159885A (en) Point-of-interest name identification method and device
CN105160032B (en) The determination method and device of the confidence level of interest point data in a kind of website
CN105447191B (en) Intelligent summarization method and corresponding device for providing graphical and text-guided steps
CN105138708A (en) Method and device for identifying names of points of interest (POI)
CN112685525B (en) Location recommendation method, device, equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220708

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co., Ltd

TR01 Transfer of patent right