CN105975477B - A Method of Automatically Constructing Place Name Dataset Based on Network - Google Patents
A Method of Automatically Constructing Place Name Dataset Based on Network Download PDFInfo
- Publication number
- CN105975477B CN105975477B CN201610214120.0A CN201610214120A CN105975477B CN 105975477 B CN105975477 B CN 105975477B CN 201610214120 A CN201610214120 A CN 201610214120A CN 105975477 B CN105975477 B CN 105975477B
- Authority
- CN
- China
- Prior art keywords
- address
- data
- information
- name
- extracted
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9537—Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/29—Geographical information databases
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Remote Sensing (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本发明公开了属于计算机应用技术领域的一种基于网络自动构建地名数据集的方法。该基于网络自动构建地名数据集包括如下步骤:1,使用谷歌搜索引擎API从谷歌数据库中提取地理空间数据;2,从提取出的数据中过滤掉不相关的网页;3,导入步骤2的输出,提取地理信息;4,选择地理编码工具,把提取的地址信息转换成地理坐标,然后标记在地图上。本发明充分发挥数据提取模块搜索引擎的优势,用恰当的搜索查询关键词从网页中检索地理信息。在网页过滤模块中,采用过滤算法来排除那些无用的干扰数据。从网页这种非结构化数据源中有效的动态的提取出地理信息,使数据同时具有高完整度和实时性。本方法有很高的实用价值。
The invention discloses a method for automatically constructing a place name data set based on a network, which belongs to the field of computer application technology. The automatic construction of place name data set based on the network includes the following steps: 1, using the Google search engine API to extract geospatial data from the Google database; 2, filtering out irrelevant web pages from the extracted data; 3, importing the output of step 2 , to extract geographic information; 4. Select the geocoding tool to convert the extracted address information into geographic coordinates, and then mark them on the map. The invention makes full use of the advantages of the search engine of the data extraction module, and retrieves geographic information from web pages with appropriate search query keywords. In the web page filtering module, filtering algorithms are used to exclude useless interference data. Effectively and dynamically extract geographic information from unstructured data sources such as web pages, making the data both highly complete and real-time. This method has high practical value.
Description
技术领域technical field
本发明属于计算机应用技术领域,特别涉及一种基于网络自动构建地名数据集的方法。The invention belongs to the field of computer application technology, in particular to a method for automatically constructing a place name data set based on a network.
背景技术Background technique
地名在构建各种地理应用中都有着重要的作用,例如,在移动设备端基于位置的服务。针对这些需求,需要能够自动建立地名数据集的技术方法。现如今网络上现有的有利用价值的地理信息量巨大且与日俱增,如何从网络中获取准确、及时的地理信息来建立地名数据集是当下的一个难题。Place names play an important role in building various geographic applications, for example, location-based services on mobile devices. In response to these demands, a technical method that can automatically create a geographical name dataset is needed. Nowadays, the amount of valuable geographic information available on the Internet is huge and increasing day by day. How to obtain accurate and timely geographic information from the Internet to establish a geographical name dataset is a difficult problem at present.
空间数据源可以分为两类:结构化的数据源和非结构化数据源。许多研究人员已经从结构化的数据来源中检索数据。他们通过链接(例如DBpedia)或者通过交互的方式(例如LinkedGeoData,Wikimapia和OpenStreetMap,LinkedGeoData提供了一个SPARQL接口,而Wikimapia和OpenStreetMap提供了一个RESTful API)下载地理数据文件。这些结构良好的数据来源提供了静态信息,但却无法呈现最新的变化。而且,例如OpenStreetMap和Wikimapia中的数据,是个人添加的,并没有经过权威机构验证。而谷歌地图的数据是经过手工验证具有较高的精度。也因此,谷歌地图中的数据更新缓慢,因为验证新的地方通常需要花费一定的时间作为代价。Spatial data sources can be divided into two categories: structured data sources and unstructured data sources. Many researchers have retrieved data from structured data sources. They download geographic data files through links (such as DBpedia) or interactively (such as LinkedGeoData, Wikimapia and OpenStreetMap, LinkedGeoData provides a SPARQL interface, and Wikimapia and OpenStreetMap provide a RESTful API). These well-structured data sources provide static information but fail to present the latest changes. Also, the data in OpenStreetMap and Wikimapia, for example, is added by individuals and has not been verified by authoritative institutions. The data of Google Maps has been manually verified with high accuracy. As a result, data updates in Google Maps are slow, as validating new places usually takes a certain amount of time as a cost.
相比之下,网页的非结构化特性便于地理信息实时改变,最新的地理信息往往是在网页上提供的。In contrast, the unstructured nature of web pages facilitates real-time changes in geographic information, and the latest geographic information is often provided on web pages.
大多数的基于网页的基于地名的信息检索都需要处理模糊的地名,因为从网页中提取的地名往往会产生歧义。例如,一个地名也有非地理意义,因为地方经常会以物体,人,外在特征,或历史因素来命名。在另一方面,许多地方会以更出名的地方的名字来命名,两个不同的地方可能会具有相同的名称。有的学者利用监督学习技术,通过来自维基百科的co-occurrence模型,来解决地名歧义的问题。Most of the place-name-based information retrieval based on web pages need to deal with ambiguous place names, because place names extracted from web pages often have ambiguity. For example, a place name also has non-geographical meaning, since places are often named after objects, people, physical features, or historical factors. On the other hand, many places will be named after more famous places, and two different places may have the same name. Some scholars use supervised learning techniques to solve the problem of geographical name ambiguity through the co-occurrence model from Wikipedia.
综上所述,目前,在构建地名数据集的研究中,缺少能够从非结构化数据源中在有效的解决地名歧义问题的前提下,抽取有效的动态变化的地理信息数据的方法。To sum up, at present, in the research of constructing place name datasets, there is a lack of methods that can extract effective and dynamically changing geographic information data from unstructured data sources on the premise of effectively solving the problem of place name ambiguity.
发明内容Contents of the invention
本发明的目的是提出一种基于网络自动构建地名数据集的方法,其特征在于,基于网络自动构建地名数据集包括如下步骤:The object of the invention is to propose a method for automatically constructing a place-name dataset based on the network, characterized in that, automatically constructing a place-name dataset based on the network comprises the steps:
步骤1:使用谷歌搜索引擎API从谷歌数据库中提取地理空间数据;Step 1: Use Google Search Engine API to extract geospatial data from Google database;
步骤2:从提取出的数据中过滤掉不相关的网页;Step 2: Filter out irrelevant web pages from the extracted data;
步骤3:导入步骤2的输出,提取地理信息;Step 3: Import the output of step 2 to extract geographic information;
步骤4:选择地理编码工具,把提取的地址信息转换成地理坐标,然后标记在地图上。Step 4: Select the geocoding tool to convert the extracted address information into geographic coordinates, and then mark them on the map.
所述步骤1具体包括如下步骤:The step 1 specifically includes the following steps:
步骤A1:从OSM(OpenSreetMap)中提取街道名称,即下载OSM数据成为一个XML文件,它是由节点、道路和相关性三个原始数据类型构成,而每一个原始数据类型都设计一系列标签,而每一个标签基本上就是由一个(k,v)对构成;其中OSM是一个领先的VGI(Volunteered Geographic Information)项目,旨在创造一个全世界范围内面向所有志愿者的可自由编辑的地理平台,它目前有超过160万注册用户,近30%的注册用户对此项目做出了实际的贡献;Step A1: Extract street names from OSM (OpenSreetMap), that is, download OSM data into an XML file, which is composed of three primitive data types: node, road and correlation, and each primitive data type is designed with a series of labels, And each label basically consists of a (k, v) pair; OSM is a leading VGI (Volunteered Geographic Information) project, aiming to create a freely editable geographic platform for all volunteers worldwide , which currently has more than 1.6 million registered users, and nearly 30% of registered users have made actual contributions to the project;
步骤A2:确定搜索关键词,搜索引擎查询的关键词由三部分组成,即街道名称、城市名称和商业类型,其中街道名称由上步骤A1得来,商业类型是先手动提供受欢迎的商业类型,然后通过后面地图显示的结果来增加缺失的类型;Step A2: Determine the search keywords. The keywords queried by the search engine are composed of three parts, namely street name, city name and business type. The street name is obtained from the above step A1, and the business type is a popular business type provided manually first. , and then add the missing type by the result shown in the map behind;
步骤A3:选定搜索引擎,从网络搜索引擎中抽取地理空间数据,该地理空间数据取决于搜索引擎的工作原理,搜索引擎的工作原理分为搜集信息、整理信息、接受查询和可视化;根据搜索方式的不同,又分为全文搜索、目录索引与元搜索。所述步骤2具体是要根据特定目标选择具体的过滤算法来对返回结果进行过滤:从搜索引擎返回的结果中滤除许多不想要的数据,提取想要的数据,在返回的数据中包括大量的房地产列表,这些房源主要包含住宅地址;在这里,使用搜索引擎搜索任意一家房地产公司网站所包含的最近售卖的住宅,所得到的搜索结果包含所有的房地产公司网址。然后,提取这些网站的URL;接下来,在步骤1的结果中自动过滤掉这些房地产公司的网址,避免从步骤1中解析到来自房地产公司网站上的无用的地理信息造成时间和资源上的浪费。Step A3: Select a search engine and extract geospatial data from the web search engine. The geospatial data depends on the working principle of the search engine. The working principle of the search engine is divided into collecting information, organizing information, accepting queries and visualization; Different methods are divided into full-text search, directory index and meta search. Said step 2 is specifically to select a specific filtering algorithm according to a specific target to filter the returned results: filter out many unwanted data from the results returned by the search engine, extract the desired data, and include a large number of The real estate listings in , which mainly contain residential addresses; here, use the search engine to search for the recently sold houses contained in any real estate company website, and the search results obtained include all real estate company URLs. Then, extract the URLs of these websites; next, automatically filter out the URLs of these real estate companies in the results of step 1, avoiding the waste of time and resources caused by parsing the useless geographic information from the real estate company websites in step 1 .
所述步骤3包括如下步骤:Described step 3 comprises the steps:
步骤C1:在地址抽取过程中,会有两种情况,第一种情况是整个地址信息都在一行里,第二种情况是地址信息在多行里;Step C1: In the process of address extraction, there are two cases. The first case is that the entire address information is in one line, and the second case is that the address information is in multiple lines;
步骤C2:在步骤C1的第一种情况下,判断网页中的一行是否以数字开头、包含城市名称,并且行的长度应该少于给定的临界值,如果该行的长度超越给定的临界值,这行还有其他地址信息的可能性就微乎其微了;Step C2: In the first case of step C1, determine whether a line in the web page starts with a number, contains a city name, and the length of the line should be less than a given critical value, if the length of the line exceeds the given critical value value, there is very little possibility of other address information in this line;
步骤C3:在步骤C1的第二种情况下,用步骤C2同样的方法来判别把两行连成一起后是否代表一个地址:如果第一行以数字开头,第二行含有城市名,在两行的长度小于给定的临界值的前提下,这两行一起作为地址被抽取;Step C3: In the second case of step C1, use the same method in step C2 to judge whether the two lines together represent an address: if the first line starts with a number, and the second line contains a city name, in the two lines On the premise that the length of the line is less than a given threshold, the two lines are extracted together as an address;
步骤C4:判定抽取出的地址是否超过一个,如果网页中对应地点相应的地址只有一个,这个网站标题就是地点名字,这种情况概率相当高;如果包含多个地址,则返回地址列表,即是说,当一个网页中提取出的地址多于一个时,把该页面中所有地址均提取出来,返回到地址列表中;Step C4: Determine whether there is more than one extracted address. If there is only one address corresponding to the corresponding location in the webpage, the website title is the name of the location. The probability of this situation is quite high; if it contains multiple addresses, return the address list, that is That is, when more than one address is extracted from a web page, all the addresses in the page are extracted and returned to the address list;
步骤C5:在步骤C4返回地址列表的前提下,对列表中的每一个地址进行搜索,在所有返回的网页中,如果返回的网页仅仅包含一个地址,而且与索引地址相同,则相应的网址标题被认做是地点名称;Step C5: On the premise that the address list is returned in step C4, search for each address in the list. Among all the returned web pages, if the returned web page contains only one address and is the same as the index address, the corresponding URL title is recognized as a place name;
步骤C6:最后,从地址列表中的每一个地址,得到相应的地名。Step C6: Finally, from each address in the address list, the corresponding place name is obtained.
所述步骤4包括如下步骤:Described step 4 comprises the steps:
步骤D1:上传数据集到一种地理编码工具上,使数据出现在上面;Step D1: Upload the data set to a geocoding tool so that the data appears on it;
步骤D2:地理编码工具自动检测位置数据,并以标签形式展现;Step D2: The geocoding tool automatically detects the location data and displays it in the form of labels;
步骤D3:点击标签,相应的信息就会呈现出来;Step D3: Click on the label, and the corresponding information will be displayed;
步骤D4:依据步骤D2地理编码工具自动检测位置数据,并以标签形式展现来选择能展示的数据或不能展示的数据,或选择以何种形式进行展示。Step D4: According to step D2, the geocoding tool automatically detects the location data, and displays it in the form of labels to select the data that can be displayed or the data that cannot be displayed, or choose what form to display.
本发明的有益效果本发明提出的一种基于网络自动构建地名数据集的方法,充分发挥数据提取模块搜索引擎的优势,用恰当的搜索查询关键词从网页中检索地理信息。在网页过滤模块中,采用过滤算法来排除那些无用的干扰数据。从网页这种非结构化数据源中有效的动态的提取出地理信息,使数据同时具有高完整度和实时性。从而克服了大多数的地理信息数据集都是来自于结构化的数据源,数据不够完整,并且实时性较差的缺点;本方法有很高的实用价值。Beneficial Effects of the Invention The method for automatically constructing place name data sets based on the network proposed by the present invention fully utilizes the advantages of the search engine of the data extraction module, and retrieves geographic information from web pages with appropriate search query keywords. In the web page filtering module, filtering algorithms are used to exclude useless interference data. Effectively and dynamically extract geographic information from unstructured data sources such as web pages, making the data both highly complete and real-time. Therefore, it overcomes the shortcomings that most geographic information data sets come from structured data sources, the data is not complete, and the real-time performance is poor; this method has high practical value.
附图说明Description of drawings
图1为搜索引擎的工作原理示意图。Figure 1 is a schematic diagram of the working principle of the search engine.
图2为从网页中检索地理信息示意图。Figure 2 is a schematic diagram of retrieving geographic information from web pages.
图3为谷歌搜索引擎的地理信息标记在地图上示意图。Figure 3 is a schematic diagram of the Google search engine's geographic information marking on the map.
图4为提取的地址信息转换成地理坐标标记在地图上示意图。Fig. 4 is a schematic diagram of converting the extracted address information into geographic coordinate marks on a map.
具体实施方式Detailed ways
本发明提出了一种基于网络自动构建地名数据集的方法,下面结合附图和实施例予以说明。The present invention proposes a method for automatically constructing a place name data set based on a network, which will be described below in conjunction with the accompanying drawings and embodiments.
如图1所示为搜索引擎的工作原理示意图。,一共有四个模块,包括数据获取(搜集信息)、Web页面过滤(整理信息)、信息抽取(接受查询)和可视化四个模块。数据获取是根据关键词从Web页面获取相关网页;Web页面过滤模块是从已获取的页面中去除不相关的页面,如房地产主页等;信息抽取是从已获取的Web页面中抽取地址、地名等地理信息;可视化是将已获取的地理信息显示于地图中,便于比较与查找。Figure 1 is a schematic diagram of the working principle of the search engine. , there are four modules in total, including data acquisition (collecting information), Web page filtering (organizing information), information extraction (accepting queries) and visualization. Data acquisition is to obtain relevant web pages from web pages according to keywords; web page filtering module is to remove irrelevant pages from the acquired pages, such as real estate homepages; information extraction is to extract addresses, place names, etc. from acquired web pages Geographical information; visualization is to display the acquired geographic information on a map for easy comparison and search.
在这个方法中,解决了如何选取查询词汇的问题,并在给定查询词汇的基础上,删除无用的反馈结果,过滤出有用的网址。在提取出网址后,在不同的情况下进行不同的解析,即完整的地理信息存在于一行或多行时,进行不同的解析方式。在此基础上,无论解析出的地址为一个或多个时,最终都能提取出有用的地名和相应的地理信息,并通过地理编码进行可视化展示。在下面实施例中,选取10个地区作为测试,通过精确率、召回率以及F值(精确率和召回率的调和均值)的计算,最终得出结论是:在以谷歌全景地图中的地理信息为ground truth的前提下,相比较OSM和Google Maps,此方法明显好于前者,总体上与后者持平,并在某些区域中也要好于后者,说明该方法有很高的实用价值。In this method, the problem of how to select query words is solved, and based on the given query words, useless feedback results are deleted and useful URLs are filtered out. After the URL is extracted, different parsing is performed in different situations, that is, when the complete geographical information exists in one or more lines, different parsing methods are performed. On this basis, regardless of whether one or more addresses are parsed out, useful place names and corresponding geographic information can eventually be extracted and visualized through geocoding. In the following example, 10 regions are selected as tests, and through the calculation of precision rate, recall rate and F value (harmonic mean of precision rate and recall rate), it is finally concluded that: the geographical information in the Google panorama map Under the premise of ground truth, compared with OSM and Google Maps, this method is significantly better than the former, on the whole equal to the latter, and better than the latter in some areas, indicating that this method has high practical value.
实施例,Example,
本方法中基于网络自动构建地名数据集包括如下步骤:In this method, automatically constructing a place name data set based on the network includes the following steps:
步骤1:使用谷歌搜索引擎API从谷歌数据库中提取地理空间数据;Step 1: Use Google Search Engine API to extract geospatial data from Google database;
步骤2:从提取出的数据中过滤掉不相关的网页;Step 2: Filter out irrelevant web pages from the extracted data;
步骤3:导入步骤2的输出,提取地理信息;Step 3: Import the output of step 2 to extract geographic information;
步骤4:选择地理编码工具,把提取的地址信息转换成地理坐标,然后标记在地图上。具体操作如下:Step 4: Select the geocoding tool to convert the extracted address information into geographic coordinates, and then mark them on the map. The specific operation is as follows:
步骤1:本步骤包括A1-A4四个具体操作步骤,首先将步骤A1和A2的结果归结为下表:Step 1: This step includes four specific operation steps A1-A4. First, summarize the results of steps A1 and A2 into the following table:
步骤A3,选定搜索引擎,从网络搜索引擎中抽取地理空间数据,因为它很大程度上取决于搜索引擎的工作原理,这里选定谷歌搜索引擎;Step A3, select a search engine, and extract geospatial data from a web search engine, because it largely depends on the working principle of the search engine, here select Google search engine;
步骤A4是用两种搜索关键词<City Name><Place Types>和<Street Names><CityName><Place Types>在谷歌搜索引擎中进行搜索,可以发现后者的搜索结果更完整,所以采取后者(如图3所示)。Step A4 is to use two search keywords <City Name><Place Types> and <Street Names><CityName><Place Types> to search in the Google search engine, you can find that the latter search results are more complete, so after taking who (as shown in Figure 3).
步骤2:从提取出的数据中过滤掉不相关的网页;Step 2: Filter out irrelevant web pages from the extracted data;
在这里,使用搜索引擎搜索任意一家房地产公司网站所包含的最近售卖的住宅,举其中一例:在谷歌里搜索“2117 Tondolea LN”,返回的第一个页面中包含房地产公司网站,如Zillow,Redfin,Movoto,Trulia的,Realtor,RE/MAX等等。然后,提取这些网站的URL,www.zillow.com和www.redfin.com等,接下来,在步骤1的结果中自动过滤掉这些房地产公司的网址。Here, use a search engine to search for recently sold homes contained in any real estate company website, for example: search for "2117 Tondolea LN" in Google, and the first page returned contains real estate company websites, such as Zillow, Redfin , Movoto, Trulia, Realtor, RE/MAX and more. Then, extract the URLs of these websites, such as www.zillow.com and www.redfin.com, etc. Next, automatically filter out the URLs of these real estate companies in the results of step 1.
步骤3:导入步骤2的输出,提取地理信息(如图2所示);Step 3: Import the output of step 2 to extract geographic information (as shown in Figure 2);
所述步骤3包括C1-C6六个具体步骤,分述如下:Said step 3 includes six concrete steps of C1-C6, which are described as follows:
步骤C1:在地址抽取过程中,会遇到两种情况,第一种情况是整个地址信息都在一行里,第二种情况是地址信息在多行里;Step C1: In the address extraction process, two situations will be encountered. The first situation is that the entire address information is in one line, and the second situation is that the address information is in multiple lines;
步骤C2:在步骤C1第一种情况下,判断网页中的一行是否以数字开头、包含城市名称,设定临界值为100;Step C2: In the first case of step C1, determine whether a line in the webpage starts with a number and contains a city name, and set the threshold value to 100;
步骤C3:在步骤C1第二种情况下,用步骤C2中同样的方法来判别把两行连成一起后是否代表一个地址;Step C3: In the second case of step C1, use the same method in step C2 to judge whether the two lines together represent an address;
步骤C4:判定抽取出的地址是否超过一个,如果网页中对应地点相应的地址只有一个,我们假定这个网站标题就是地点名字,如果包含多个地址,则返回地址列表;Step C4: Determine whether the extracted address is more than one. If there is only one address corresponding to the corresponding place in the web page, we assume that the website title is the name of the place. If it contains multiple addresses, return the address list;
步骤C5:在步骤C4返回地址列表的前提下,对列表中的每一个地址进行搜索,在所有返回的网页中,如果返回的网页仅仅包含一个地址,而且与索引地址相同,则相应的网址标题被认做是地点名称;Step C5: On the premise that the address list is returned in step C4, search for each address in the list. Among all the returned web pages, if the returned web page contains only one address and is the same as the index address, the corresponding URL title is recognized as a place name;
步骤C4、C5的伪码如下:The pseudo codes of steps C4 and C5 are as follows:
步骤C6:最后,从地址列表中的每一个地址中得到相应的地名。Step C6: Finally, get the corresponding place name from each address in the address list.
步骤4:选择地理编码工具,把提取的地址信息转换成地理坐标,然后标记在地图上(如图4所示);具体操作步骤如下:Step 4: Select the geocoding tool, convert the extracted address information into geographic coordinates, and then mark it on the map (as shown in Figure 4); the specific operation steps are as follows:
步骤D1:上传数据集到Google Fusion Tables(一种地理编码工具),上传的数据将会出现在上面;Step D1: Upload the dataset to Google Fusion Tables (a geocoding tool), and the uploaded data will appear on it;
步骤D2:Google Fusion Tables自动检测位置数据,并以称为Map of<locationcolumn name>的标签形式展现;Step D2: Google Fusion Tables automatically detects location data and displays it in the form of a label called Map of<locationcolumn name>;
步骤D3:点击标签,相应的信息就会呈现出来。Step D3: Click on the label, and the corresponding information will be displayed.
步骤D4:可以选择哪些数据会展示,哪些不会展示,也可以选择以何种形式展示,例如在工作中定义了两种针脚,带Y标签的蓝色针脚展示了从信息抽取步骤中获取的正确的地点名称,而带N标签的红色针脚呈现的是那些被滤出的信息,例如待租或待售的房屋。Step D4: You can choose which data will be displayed, which will not be displayed, and in which form you can choose to display it. For example, two pins are defined in the work, and the blue pins with Y labels show the information obtained from the information extraction step. The correct place name, while the red pins with the N label present those that were filtered out, such as homes for rent or for sale.
综上所述,本发明用数据提取模块搜索引擎的优势,以恰当的搜索查询关键词从网页中检索地理信息;用网页过滤模块提出一个过滤算法来排除那些无用的干扰数据。其提取位置信息模块采用现有的算法来提取有用的信息,例如地点名称和地址。通过可视化模块把已经提取的在地图上的地名进行可视化,目的是评估已经产生的地名数据集的效果。In summary, the present invention uses the advantages of the data extraction module search engine to retrieve geographical information from web pages with appropriate search query keywords; uses the web page filtering module to propose a filtering algorithm to eliminate those useless interference data. Its Extract Location Information module uses existing algorithms to extract useful information such as place names and addresses. Visualize the extracted place names on the map through the visualization module, the purpose is to evaluate the effect of the place name data set that has been generated.
Claims (1)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610214120.0A CN105975477B (en) | 2016-04-07 | 2016-04-07 | A Method of Automatically Constructing Place Name Dataset Based on Network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610214120.0A CN105975477B (en) | 2016-04-07 | 2016-04-07 | A Method of Automatically Constructing Place Name Dataset Based on Network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105975477A CN105975477A (en) | 2016-09-28 |
CN105975477B true CN105975477B (en) | 2019-11-08 |
Family
ID=56989512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610214120.0A Expired - Fee Related CN105975477B (en) | 2016-04-07 | 2016-04-07 | A Method of Automatically Constructing Place Name Dataset Based on Network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105975477B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108153865A (en) * | 2017-12-22 | 2018-06-12 | 中山市小榄企业服务有限公司 | Internet network application acquisition system |
CN109974726A (en) * | 2017-12-28 | 2019-07-05 | 北京搜狗科技发展有限公司 | A kind of road state determines method and device |
CN108334579B (en) * | 2018-01-25 | 2021-10-22 | 孙如江 | Space-time service-based place name identification number coding device |
CN108984640A (en) * | 2018-06-22 | 2018-12-11 | 华北电力大学 | A Geographical Information Acquisition Method Based on Web Data Mining |
CN112084389A (en) * | 2020-08-17 | 2020-12-15 | 上海交通大学 | Network crawler-based academic institution geographical position information extraction method |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105335468A (en) * | 2015-09-28 | 2016-02-17 | 北京信息科技大学 | Geographic position entity normalized method based on Baidu map API |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8055671B2 (en) * | 2007-08-29 | 2011-11-08 | Enpulz, Llc | Search engine using world map with whois database search restriction |
-
2016
- 2016-04-07 CN CN201610214120.0A patent/CN105975477B/en not_active Expired - Fee Related
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105335468A (en) * | 2015-09-28 | 2016-02-17 | 北京信息科技大学 | Geographic position entity normalized method based on Baidu map API |
Non-Patent Citations (1)
Title |
---|
垂直搜索引擎中的多元化信息融合检索研究;宁登鹏;《中国优秀硕士学位论文数据库 信息科技辑》;20080715(第7期);第4章第4.1.1-4.2.2节、第5章第5.1.3节 * |
Also Published As
Publication number | Publication date |
---|---|
CN105975477A (en) | 2016-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105975477B (en) | A Method of Automatically Constructing Place Name Dataset Based on Network | |
CN103092950B (en) | A kind of network public-opinion geographic position real-time monitoring system and method | |
CN101350154B (en) | Method and device for sorting electronic map data | |
US6611751B2 (en) | Method and apparatus for providing location based data services | |
CN100481077C (en) | Visual method and device for strengthening search result guide | |
CN101882163A (en) | A Geographic Assignment Method of Fuzzy Chinese Addresses Based on Matching Rules | |
US20090132469A1 (en) | Geocoding based on neighborhoods and other uniquely defined informal spaces or geographical regions | |
CN104462227A (en) | Automatic construction method of graphic knowledge genealogy | |
CN107944898A (en) | The automatic discovery of advertisement putting building information and sort method | |
CN101350013A (en) | Method and system for searching geographical information | |
CN100478960C (en) | Method for locating unknown place name in network map service | |
Chuang et al. | Enabling maps/location searches on mobile devices: Constructing a POI database via focused crawling and information extraction | |
CN107908627A (en) | A kind of multilingual map POI search systems | |
WO2015018247A1 (en) | Event multi-dimensional information display device and method | |
CN102880721A (en) | Implementation method of vertical search engine | |
US8700624B1 (en) | Collaborative search apps platform for web search | |
CN100470549C (en) | Form locating data mining method | |
CN108984640A (en) | A Geographical Information Acquisition Method Based on Web Data Mining | |
CN113626536B (en) | News geocoding method based on deep learning | |
CN104657486B (en) | A kind of method that confidence level based on polyfactorial administrative division calculates | |
TWM523901U (en) | Search engine device for performing semantic keyword analysis | |
Manguinhas et al. | A geo-temporal web gazetteer integrating data from multiple sources | |
CN108846018A (en) | A kind of Chinese food safety media event Information Automatic Extraction method towards news | |
TW201040752A (en) | Method and system for providing localized information | |
TWM523902U (en) | Search engine device for collecting keyword |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20191108 Termination date: 20200407 |