[go: up one dir, main page]

CN103116580A - Providing method, system and device of website content information - Google Patents

Providing method, system and device of website content information Download PDF

Info

Publication number
CN103116580A
CN103116580A CN2011103626460A CN201110362646A CN103116580A CN 103116580 A CN103116580 A CN 103116580A CN 2011103626460 A CN2011103626460 A CN 2011103626460A CN 201110362646 A CN201110362646 A CN 201110362646A CN 103116580 A CN103116580 A CN 103116580A
Authority
CN
China
Prior art keywords
website
information
link
resource
view
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2011103626460A
Other languages
Chinese (zh)
Inventor
王寓辰
倪伟
毕娅娜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN2011103626460A priority Critical patent/CN103116580A/en
Publication of CN103116580A publication Critical patent/CN103116580A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种网站内容信息提供方法、系统及装置,该方法包括:根据获得的引入网站的初始链接信息进行爬行搜索,获取到所述引入网站包括的链接信息,并获取所述链接信息的链接对象及其属性信息;根据获取的所述链接信息的链接对象及其属性信息,建立所述链接信息对应的链接对象索引;根据各所述链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图;所述网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引;根据建立的网站资源视图向网站信息请求方提供网站内容信息。可以准确及时的获取引入网站的内容信息,支持信息请求方对内容的准确调度。

Figure 201110362646

The invention discloses a method, system and device for providing website content information. The method includes: performing a crawling search according to the acquired initial link information of the imported website, obtaining the link information included in the imported website, and obtaining the link information link objects and their attribute information; according to the obtained link objects and their attribute information of the link information, establish the link object index corresponding to the link information; according to the association relationship between the link object indexes of the link information, Establishing a website resource view of each imported website; the website resource view includes link object indexes corresponding to the link information of each imported website arranged according to set rules; and providing website content information to the website information requester according to the established website resource view. The content information imported into the website can be accurately and timely obtained, and the accurate scheduling of the content by the information requester is supported.

Figure 201110362646

Description

网站内容信息提供方法、系统及装置Method, system and device for providing website content information

技术领域 technical field

本发明涉及数据业务领域,尤指一种网站内容信息提供方法、系统及装置。The invention relates to the field of data services, in particular to a method, system and device for providing website content information.

背景技术 Background technique

互联网数据中心(Internet Data Center,IDC)可以实现引入网站内容,为用户提供服务。IDC可以直接或间接通过内容分发网络(Content DeliveryNetwork,CDN)为用户提供网站内容服务。The Internet Data Center (IDC) can realize the introduction of website content and provide services for users. IDC can provide users with website content services directly or indirectly through Content Delivery Network (CDN).

目前,IDC引入网站内容源时通常为非全量方式,只引入网站部分频道或部分内容,IDC对托管网站内容信息的获取、更新和管理通常采用的是如下方式,一是由网站资源提供方手动申报,从而掌握引入的网站资源信息;二是由IDC管理员手动获取并配置引入的网站资源信息。At present, when IDC imports website content sources, it usually adopts a non-full method, and only imports some channels or content of the website. IDC usually adopts the following methods to obtain, update and manage the content information of hosted websites. First, the website resource provider manually Report, so as to grasp the imported website resource information; second, the IDC administrator manually obtains and configures the imported website resource information.

上述IDC引入网站内容的方式中,方式一高度依赖内容提供方的主动操作,无法保证网站内容索引的准确性、及时性和精度要求;方式二需要耗费大量的IDC管理人工成本,且效率底下,无法保证引入内容索引的及时更新。也就是说,现阶段对IDC网站内容引入的控制仅实现了设备级的控制,内容级的控制比较粗放,很难以较低的成本准确、及时的获取IDC引入网站的内容索引信息,为IDC内容的精确管理造成一定困难。Among the above methods of IDC importing website content, method 1 is highly dependent on the active operation of the content provider, and cannot guarantee the accuracy, timeliness and precision requirements of website content indexing; method 2 requires a lot of IDC management labor costs, and the efficiency is low. There is no guarantee that the Index of Introduced Content will be kept up-to-date. That is to say, the current control over the introduction of IDC website content only implements device-level control, and the content-level control is relatively extensive. It is difficult to obtain the content index information of IDC imported websites accurately and in a timely manner at a relatively low cost. Accurate management poses certain difficulties.

现有的IDC内容引入和控制机制,用于CDN网络时,由于CDN网络中除了主要面向IDC引入的网站内容服务之外,还存在缓存热点网站内容的缓存控制(WebCache)系统,CDN资源调度中心会统一协调用户对IDC引入的和WebCache系统缓存的网站内容的访问调度,以便用户合理访问IDC和WebCache系统引入的网站内容。When the existing IDC content introduction and control mechanism is used in a CDN network, in addition to the website content service mainly oriented to the IDC introduction in the CDN network, there is also a cache control (WebCache) system for caching hot website content, and the CDN resource scheduling center It will uniformly coordinate the user's access schedule to the website content introduced by IDC and cached by the WebCache system, so that users can reasonably access the website content introduced by IDC and WebCache system.

一般情况下,由于IDC直接面向内容提供方引入网站内容,其引入的网站内容相对于WebCache系统缓存的网站内容更新应该更及时,因此,一般希望优先为用户调度IDC引入的网站内容,但CDN网络中还存在WebCache系统,很可能存在用户请求访问的内容在已被引入IDC时却被调度至WebCache系统,极大浪费了IDC系统资源与WebCache缓存资源,为了避免冲突,需要详细掌握IDC引入的网站内容信息。In general, since IDC directly introduces website content to content providers, the website content it introduces should be more timely than the website content cached by the WebCache system. Therefore, it is generally hoped to prioritize the scheduling of website content introduced by IDC for users. There is also a WebCache system in the network. It is very likely that the content requested by the user is scheduled to the WebCache system when it has been introduced into the IDC, which greatly wastes the IDC system resources and WebCache cache resources. In order to avoid conflicts, it is necessary to have a detailed understanding of the websites introduced by the IDC. content information.

在现有的IDC内容引入和控制机制下,IDC引入的网站内容信息主要依靠提供方或IDC管理员手动更新,更新速度慢,操作复杂,准确性低,实时性较差。IDC网站内容源更新后,CDN总线系统无法及时获知IDC内容的变更情况,因此,用户访问时就有可能出现IDC引入的网站内容和WebCache缓存的网站内容的访问冲突,CDN总线无法获知应该优先为用户调度IDC引入网站资源还是WebCache缓存的网站资源。Under the existing IDC content introduction and control mechanism, the website content information introduced by IDC mainly depends on the provider or IDC administrator to manually update, which is slow in update speed, complicated in operation, low in accuracy, and poor in real-time performance. After the content source of the IDC website is updated, the CDN bus system cannot know the changes of the IDC content in a timely manner. Therefore, when the user visits, there may be an access conflict between the website content introduced by the IDC and the website content cached by the WebCache, and the CDN bus cannot be informed. Whether the user schedules the IDC to import website resources or the website resources cached by WebCache.

发明内容 Contents of the invention

本发明实施例提供一种网站内容信息提供方法、系统及装置,用以解决现有技术中存在无法准确获知IDC引入的网站内容,导致无法精确调度用户的内容访问请求,浪费系统资源的问题。Embodiments of the present invention provide a method, system and device for providing website content information to solve the problem in the prior art that website content imported by IDC cannot be accurately known, resulting in inability to accurately schedule user content access requests and wasting system resources.

一种网站内容信息提供方法,包括:A method for providing website content information, comprising:

根据获得的引入网站的初始链接信息进行爬行搜索,获取到所述引入网站包括的链接信息,并获取所述链接信息的链接对象及其属性信息;performing a crawl search according to the obtained initial link information of the imported website, obtaining the link information included in the imported website, and obtaining the link object and its attribute information of the linked information;

根据获取的所述链接信息的链接对象及其属性信息,建立所述链接信息对应的链接对象索引;Establishing a link object index corresponding to the link information according to the acquired link object and attribute information of the link information;

根据各所述链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图;所述网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引;According to the association relationship between the link object indexes of each of the link information, establish the website resource view of each imported website; the website resource view includes the link object indexes corresponding to the link information of each imported website arranged according to the set rules;

根据建立的网站资源视图向网站信息请求方提供网站内容信息。Provide website content information to the website information requester based on the established website resource view.

一种网站内容信息提供装置,包括:A website content information providing device, comprising:

搜索模块,用于根据获得的引入网站的初始链接信息进行爬行搜索,获取到所述引入网站包括的链接信息,并获取所述链接信息的链接对象及其属性信息;A search module, configured to crawl and search according to the obtained initial link information of the imported website, obtain the link information included in the imported website, and obtain the link object and its attribute information of the linked information;

索引模块,用于根据获取的链接对象及其属性信息,建立所述链接信息对应的链接对象索引;An index module, configured to establish a link object index corresponding to the link information according to the acquired link object and its attribute information;

视图资源生成模块,用于根据各所述链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图;所述网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引;The view resource generation module is used to establish the website resource view of each imported website according to the association relationship between the link object indexes of each of the link information; the website resource view includes the links of each imported website arranged according to the set rules The link object index corresponding to the information;

接入检索模块,用于根据建立的网站资源视图向网站信息请求方提供网站内容信息。An access retrieval module is used to provide website content information to the website information requester according to the established website resource view.

一种网站内容信息提供系统,包括上述的网站内容信息提供装置和至少一个网站信息请求设备。A system for providing website content information, comprising the above-mentioned apparatus for providing website content information and at least one website information requesting device.

本发明有益效果如下:The beneficial effects of the present invention are as follows:

本发明实施例提供的网站内容信息提供方法、系统及装置,将包括网站初始链接信息及与其逐级关联的所有链接信息,及其对应的链接对象进行关联索引,并建立网站资源视图,从而可以准确获知IDC引入的网站内容信息并为用户提供,同时还可以节约查询时间减少系统资源的利用;即使在同时存在IDC引入的网站资源和WebCache缓存的网站资源,也可以优先为用户调度IDC引入的网站资源,避免内容访问调度时发生冲突,节约系统资源。The method, system and device for providing website content information provided by the embodiments of the present invention will include the initial link information of the website and all the link information associated with it step by step, as well as the corresponding link objects, perform an association index, and establish a website resource view, so that the website resource view can be established Accurately know the website content information introduced by IDC and provide it to users, and at the same time save query time and reduce the utilization of system resources; even if there are website resources introduced by IDC and website resources cached by WebCache at the same time, it can prioritize scheduling for users. Website resources, avoid conflicts during content access scheduling, and save system resources.

附图说明 Description of drawings

此处所说明的附图用来提供对本发明的进一步理解,构成本发明的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发明的不当限定。在附图中:The accompanying drawings described here are used to provide a further understanding of the present invention, and constitute a part of the present invention. The schematic embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute improper limitations to the present invention. In the attached picture:

图1为本发明实施例中网站内容信息提供方法的流程图;1 is a flowchart of a method for providing website content information in an embodiment of the present invention;

图2为本发明实施例中网站内容信息提供系统的结构示意图;2 is a schematic structural diagram of a system for providing website content information in an embodiment of the present invention;

图3为本发明实施例中网站内容信息提供装置的结构示意图;3 is a schematic structural diagram of a device for providing website content information in an embodiment of the present invention;

图4为本发明实施例中网站内容信息提供系统的具体结构示意图;4 is a schematic structural diagram of a system for providing website content information in an embodiment of the present invention;

图5为本发明实施例中网站内容信息提供装置生成资源视图的流程图;FIG. 5 is a flow chart of generating a resource view by a website content information providing device in an embodiment of the present invention;

图6为本发明实施例一中网站内容信息提供方法的流程图;6 is a flowchart of a method for providing website content information in Embodiment 1 of the present invention;

图7为本发明实施例二中网站内容信息提供方法的流程图。FIG. 7 is a flowchart of a method for providing website content information in Embodiment 2 of the present invention.

具体实施方式 Detailed ways

为了使本发明所要解决的技术问题、技术方案及有益效果更加清楚、明白,以下结合附图和实施例,对本发明进行进一步详细说明。应当理解,此处所描述的具体实施例仅仅用以解释本发明,并不用于限定本发明。In order to make the technical problems, technical solutions and beneficial effects to be solved by the present invention clearer and clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

本发明实施例提供一种网站内容信息提供方法,通过建立网站资源视图获取更新的网站内容索引,将包括网站初始链接信息及其关联的所有链接信息所对应的链接对象进行关联索引,并建立网站资源视图,根据网站资源视图实现网站内容信息的提供,该方法流程如图1所示,包括如下步骤:An embodiment of the present invention provides a method for providing website content information. By establishing a website resource view to obtain an updated website content index, the link objects corresponding to the initial link information of the website and all associated link information are indexed, and the website is established. The resource view is used to realize the provision of website content information according to the website resource view. The flow of the method is shown in Figure 1, including the following steps:

步骤S11:根据获得的引入网站的初始链接信息进行爬行搜索,获取到引入网站包括的链接信息,并获取得到的链接信息的链接对象及其属性信息。Step S11: Perform a crawling search according to the obtained initial link information of the imported website, obtain the link information contained in the imported website, and obtain the link object and its attribute information of the obtained link information.

根据获得的引入网站的初始链接信息进行爬行搜索时,具体是根据获得的引入网站的初始链接信息和预先配置的搜索策略进行爬行搜索的。其中,搜索搜索策略包括下列策略之一或组合:深度优先策略、广度优先策略和聚焦搜索策略。When the crawling search is performed according to the obtained initial link information of the incoming website, specifically, the crawling search is performed according to the obtained initial link information of the incoming website and a pre-configured search strategy. Wherein, the search strategy includes one or a combination of the following strategies: a depth-first strategy, a breadth-first strategy, and a focused search strategy.

根据获取的引入网站的初始链接信息可以爬行搜索到所有的关联链接信息,即可以根据初始链接信息获取初始链接对象及其属性,以及从当前链接对象进行爬行搜索,获取新链接信息,并不断由新链接信息获取对应对象及属性。其中,初始链接信息可以是初始网页的顶层域名,比如统一资源定位符(Uniform/Universal Resource Locator,URL),关联链接信息可以是网页上爬行搜索到的每个URL。根据爬行搜索到的包括初始链接信息和关联链接信息的所有链接信息,爬行搜索每个链接信息的链接对象。According to the obtained initial link information of the imported website, all associated link information can be crawled and searched, that is, the initial link object and its attributes can be obtained according to the initial link information, and the crawl search can be performed from the current link object to obtain new link information. Get the corresponding object and attribute of the new link information. Wherein, the initial link information may be the top-level domain name of the initial web page, such as a uniform resource locator (Uniform/Universal Resource Locator, URL), and the associated link information may be each URL found by crawling and searching on the web page. According to all link information including initial link information and associated link information found through crawling, crawling searches for a link object of each link information.

上述链接对象包括链接信息对应的网页和/或文件;上述链接对象的属性信息包括下列信息之一或组合:链接值、链接类型、网页标题、被抓取次数、抓取时间、抓取深度、是否首次抓取、默认编码、网页快照、文件对象名和对象类型。The above-mentioned link object includes the web page and/or file corresponding to the link information; the attribute information of the above-mentioned link object includes one or a combination of the following information: link value, link type, web page title, number of crawls, crawl time, crawl depth, Whether to first fetch, default encoding, page snapshot, file object name, and object type.

例如:可以通过一个网站内容信息提供装置实现爬行搜索,从IDC业务提供设备提供的一个或若干个引入网站顶层域名链接开始爬取初始网页上的URL,对于每个URL,爬虫保存该链接对应的网页或文件等链接对象的属性信息,包括但不限于链接值、链接类型、网页Title、被抓取次数、抓取时间、抓取深度、是否首次抓取、默认编码、网页快照、文件对象名、对象类型等信息。同时,爬虫不断从当前页面上抽取新的URL放入队列,待分析完毕当前页面后,从队列中提取新URL继续爬取网页或对象信息,直到满足预设的搜索停止条件。For example: crawling search can be realized through a website content information providing device, and the URL on the initial web page is crawled from one or several top-level domain name links of the imported website provided by the IDC service providing device. For each URL, the crawler saves the URL corresponding to the link. Attribute information of link objects such as webpages or files, including but not limited to link value, link type, webpage Title, number of crawls, crawling time, crawling depth, whether it is the first crawl, default encoding, webpage snapshot, file object name , object type and other information. At the same time, the crawler continuously extracts new URLs from the current page and puts them into the queue. After analyzing the current page, it extracts new URLs from the queue and continues to crawl webpage or object information until the preset search stop condition is met.

步骤S12:根据获取的链接信息的链接对象及其属性信息,建立获取的链接信息对应的链接对象索引。Step S12: According to the link object and its attribute information of the obtained link information, establish a link object index corresponding to the obtained link information.

根据获取的链接信息包括的链接对象及其属性信息构建各链接对象的内容索引,以及根据获取的链接信息的路径信息确定链接信息之间的关联关系;经过分析、过滤,建立起包括各链接信息关联关系以及各链接信息包括的链接对象的内容索引的链接对象索引。Construct the content index of each link object according to the link object and its attribute information included in the obtained link information, and determine the relationship between the link information according to the path information of the obtained link information; after analysis and filtering, establish a link information including each The link object index of the content index of the link object included in the association relationship and each link information.

对爬行搜索到的链接信息,以及链接信息的链接对象及其属性信息进行处理,包括构建各链接对象的内容索引和各链接对象的数据关联。根据爬虫获取的链接值、链接类型、网页Title、被抓取次数、抓取时间、抓取深度、是否首次抓取、默认编码、网页快照、文件对象名、对象类型等信息进行索引编制,构建各链接对象的内容索引;记录爬虫抓取的URL路径,判断不同URL间的父子关系,形成内容索引之间的关联关系,得到各链接信息的链接对象索引,为生成网站资源全局视图提供数据支持。The link information obtained by crawling search, link objects of the link information and their attribute information are processed, including building a content index of each link object and data association of each link object. According to the information obtained by the crawler, such as link value, link type, webpage Title, number of times crawled, crawling time, crawling depth, whether it is crawled for the first time, default encoding, web page snapshot, file object name, object type, etc., indexing and construction The content index of each link object; record the URL path captured by the crawler, judge the parent-child relationship between different URLs, form the association relationship between content indexes, obtain the link object index of each link information, and provide data support for generating a global view of website resources .

优选的,建立链接信息对应的链接对象索引之前,还包括对爬行搜索到的链接信息,以及链接信息的链接对象及其属性信息进行数据去重处理。可以对爬行搜索到的链接信息,以及链接信息的链接对象及其属性信息进行做MD5(消息摘要算法第五版)运算,根据计算得到的MD5值判断是否和已经建立链接对象索引的链接信息相同,当相同时不再建立链接对象索引。当然也可以通过其他方式判断爬行搜索到的链接信息是否和已经建立链接对象索引的链接信息相同。例如:当一个URL被抓取成功后,在更新时间段之内不需要再被抓取,但是其他网页可能包含这个URL,因此需要对URL去重。本系统采用对已经抓取的URL做MD5运算,通过比较URL的MD5值保证抓取URL的唯一性,即对于MD5值相同的URL不再进行重复抓取。Preferably, before establishing the link object index corresponding to the link information, it also includes performing data deduplication processing on the link information searched by crawling, as well as the link objects of the link information and their attribute information. It can perform MD5 (message digest algorithm fifth edition) calculation on the link information found by crawling search, as well as the link object and its attribute information of the link information, and judge whether the calculated MD5 value is the same as the link information that has established the link object index , when it is the same, no link object index will be established. Of course, other methods can also be used to determine whether the link information found by crawling is the same as the link information for which the link object index has been established. For example: After a URL is successfully crawled, it does not need to be crawled again within the update time period, but other web pages may contain this URL, so the URL needs to be deduplicated. This system uses MD5 calculations on URLs that have been captured, and compares the MD5 values of the URLs to ensure the uniqueness of the captured URLs, that is, URLs with the same MD5 value will not be repeatedly captured.

上述通过对爬行搜索到的链接信息的相关数据进行索引编制、关联、清洗、去重等处理,实现生成标准IDC内容索引数据,得到各链接信息的链接对象索引。By performing indexing, associating, cleaning, deduplication and other processing on the relevant data of the link information obtained by crawling search, the standard IDC content index data is generated, and the link object index of each link information is obtained.

步骤S13:根据各链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图。其中建立的网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引。Step S13: According to the association relationship between the link object indexes of each link information, a website resource view of each imported website is established. The website resource view established therein includes link object indexes corresponding to link information of each imported website arranged according to set rules.

根据上述索引建立的各链接信息的链接对象索引之间的关联关系,例如各链接信息的父子关系,可以实现建立各引入网站的网站资源视图。According to the association relationship between the link object indexes of each link information established by the above index, for example, the parent-child relationship of each link information, the website resource view of each imported website can be established.

步骤S14:根据建立的网站资源视图向网站信息请求方提供网站内容信息。Step S14: Provide website content information to the website information requester according to the established website resource view.

根据建立的网站资源视图向网站信息请求方提供网站内容信息时,一般是通过网站信息请求设备向网站信息请求方提供网站内容信息,可以通过将资源视图提供给网站信息请求设备的方式,也可以采用开放查询接口供网站信息请求设备查询资源视图的方式,由网站内容信息请求设备根据网站资源视图想网站信息请求方提供网站内容信息。其中,IDC网站信息请求设备可能是CDN资源总线,也可以是IDC业务平台或其它的IDC网站信息请求设备。When providing website content information to the website information requester based on the established website resource view, the website content information is generally provided to the website information requester through the website information requesting device, either by providing the resource view to the website information requesting device, or by An open query interface is adopted for the website information requesting device to query the resource view, and the website content information requesting device provides the website content information to the website information requester according to the website resource view. Wherein, the IDC website information requesting device may be a CDN resource bus, or an IDC service platform or other IDC website information requesting devices.

当然可选的,也可以不通过网站内容信息请求设备,直接根据建立的网站资源视图向网站信息请求方提供网站内容信息。Of course, it is optional, and the website content information may be directly provided to the website information requester according to the established website resource view without using the website content information requesting device.

其中,将资源视图提供给网站信息请求设备的方式,具体包括:根据网站信息请求设备发送的视图资源获取请求,将建立的网站资源视图提供给网站信息请求设备或根据视图资源获取请求中的配置要求对建立的网站资源视图进行配置调整后提供给网站信息请求设备,由网站信息请求设备根据获取的网站资源视图向网站信息请求方提供所请求的网站内容信息;包括根据提供的网站资源视图提供网站访问调度服务或IDC站点管理服务。Among them, the method of providing the resource view to the website information requesting device specifically includes: according to the view resource acquisition request sent by the website information requesting device, providing the established website resource view to the website information requesting device or according to the configuration in the view resource obtaining request It is required to configure and adjust the established website resource view and provide it to the website information requesting device, and the website information requesting device will provide the requested website content information to the website information requester according to the obtained website resource view; including providing information based on the provided website resource view Website access scheduling service or IDC site management service.

其中,开放查询接口供网站信息请求设备查询资源视图的方式,具体包括:根根据网站信息请求设备发送的视图资源查询请求,向网站信息请求设备开放查询接口,通过查询接口向网站信息请求设备提供建立的网站资源视图或提供根据视图资源获取请求中的配置要求对建立的网站资源视图进行配置调整后的网站资源视图;由网站信息请求设备根据查询到的网站资源视图向网站信息请求方提供所请求的网站内容信息;包括根据提供的网站资源视图提供网站访问调度服务或IDC站点管理服务。Among them, the method of opening the query interface for the website information requesting device to query the resource view specifically includes: opening the query interface to the website information requesting device according to the view resource query request sent by the website information requesting device, and providing the website information requesting device through the query interface The established website resource view or provide a website resource view after configuring and adjusting the established website resource view according to the configuration requirements in the view resource acquisition request; the website information requesting device provides the website information requester with all Requested website content information; including providing website access scheduling services or IDC site management services based on provided website resource views.

基于本发明实施例提供的上述网站内容信息提供方法,本发明实施例还提供一种网站内容信息提供系统,其结构如图2所示,包括上述的网站内容信息提供装置和至少一个网站信息请求设备,两者之间可以通过IF1、IF2等接口连接。例如:网站内容信息提供装置可以是IDC内容信息同步装置,网站信息请求设备可以是IDC业务平台和CDN资源总线等。网站内容信息提供装置与IDC业务平台、CDN资源总线等若干IDC网站信息请求设备之间实现信息交互,为IDC业务平台、CDN资源总线等网站信息请求设备提供建立的引入网站的资源视图下载或查询。Based on the above-mentioned website content information providing method provided by the embodiment of the present invention, the embodiment of the present invention also provides a website content information providing system, its structure is shown in Figure 2, including the above-mentioned website content information providing device and at least one website information request The two devices can be connected through interfaces such as IF1 and IF2. For example: the device for providing website content information may be an IDC content information synchronization device, and the device for requesting website information may be an IDC service platform and a CDN resource bus. Realize information interaction between the website content information providing device and several IDC website information requesting devices such as the IDC business platform and CDN resource bus, and provide the resource view download or query of the established imported website for the website information requesting devices such as the IDC business platform and CDN resource bus .

上述CDN资源总线可以实现资源管理、内容管理、用户调度等功能。通过网站内容信息提供装置获取IDC引入的网站内容的资源视图后,合理调度用户访问请求,并按照内容分发策略将IDC网站内容分发至适当CDN内容节点与服务节点。上述IDC业务平台可以实现硬件管理和软件管理。其中软件管理可以为网站内容提供方提供基础的内容配置与管理功能,并为IDC内容信息同步装置提供基本的域名信息等。The above-mentioned CDN resource bus can implement functions such as resource management, content management, and user scheduling. After obtaining the resource view of the website content introduced by IDC through the website content information providing device, it reasonably schedules user access requests, and distributes the IDC website content to appropriate CDN content nodes and service nodes according to the content distribution strategy. The above-mentioned IDC service platform can realize hardware management and software management. Among them, the software management can provide basic content configuration and management functions for the website content provider, and provide basic domain name information for the IDC content information synchronization device.

基于本发明实施例提供的上述网站内容信息提供方法,本发明实施例还提供一种网站内容信息提供装置,且结构如图3所示,包括:搜索模块10、索引模块20、视图资源生成模块30和接入检索模块40。Based on the above method for providing website content information provided by the embodiment of the present invention, the embodiment of the present invention also provides a device for providing website content information, and its structure is shown in Figure 3, including: a search module 10, an index module 20, and a view resource generation module 30 and access retrieval module 40.

搜索模块10,用于根据获得的引入网站的初始链接信息进行爬行搜索,获取到引入网站包括的链接信息,并获取链接信息的链接对象及其属性信息。The search module 10 is configured to crawl and search according to the obtained initial link information of the imported website, obtain the link information included in the imported website, and obtain the link object and its attribute information of the link information.

索引模块20,用于根据获取的链接对象的属性信息,建立获取的链接信息对应的链接对象索引。The indexing module 20 is configured to establish a link object index corresponding to the obtained link information according to the obtained attribute information of the link object.

视图资源生成模块30,用于根据各获取链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图;其中,网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引。The view resource generation module 30 is used to establish the website resource view of each imported website according to the association relationship between the link object indexes for obtaining link information; wherein, the website resource view includes the links of each imported website arranged according to the set rules The link object index corresponding to the information.

接入检索模块40,用于根据建立的网站资源视图向网站信息请求方提供网站内容信息。The access retrieval module 40 is configured to provide website content information to the website information requester according to the established website resource view.

优选的,上述网站内容信息提供装置还包括:搜索策略管理模块50;其中:Preferably, the above-mentioned website content information providing device further includes: a search strategy management module 50; wherein:

搜索策略管理模块50,用于配置搜索策略,配置的搜索策略包括下列策略之一或组合:深度优先策略、广度优先策略和聚焦搜索策略。The search strategy management module 50 is configured to configure a search strategy, and the configured search strategy includes one or a combination of the following strategies: a depth-first strategy, a breadth-first strategy and a focused search strategy.

相应的,上述搜索模块10,具体用于根据获得的引入网站的初始链接信息和预先配置的搜索策略进行爬行搜索。Correspondingly, the above-mentioned search module 10 is specifically configured to perform crawling search according to the obtained initial link information of the imported website and the pre-configured search strategy.

优选的,上述索引模块20,具体用于根据获取的链接信息的链接对象及其属性信息构建各链接对象的内容索引,以及根据获取的链接信息的路径信息确定各链接信息之间的关联关系;建立起包括各链接信息关联关系以及各链接信息包括的链接对象的内容索引的链接对象索引。Preferably, the above-mentioned index module 20 is specifically configured to construct a content index of each link object according to the obtained link objects of the link information and their attribute information, and determine the association relationship between each link information according to the obtained path information of the link information; A link object index including the association relationship of each link information and the content index of the link objects included in each link information is established.

优选的,上述索引模块20还用于:在建立爬行搜索到的所述链接信息对应的链接对象索引之前,对爬行搜索到的链接信息,以及链接信息的链接对象及其属性信息进行数据去重处理。Preferably, the above-mentioned indexing module 20 is also used for: before establishing the link object index corresponding to the link information searched by crawling, data deduplication is performed on the link information searched by crawling, as well as the link objects of the link information and their attribute information deal with.

优选的,上述接入检索模块40,具体用于根据网站信息请求设备发送的视图资源获取请求,将建立的网站资源视图提供给网站信息请求设备或根据视图资源获取请求中的配置要求对建立的网站资源视图进行配置调整后提供给网站信息请求设备,由网站信息请求设备根据提供的网站资源视图向网站信息请求方提供所请求的网站内容信息;或根据网站信息请求设备发送的视图资源查询请求,向网站信息请求设备开放查询接口,通过查询接口向网站信息请求设备提供建立的网站资源视图或提供根据视图资源查询请求中的配置要求对建立的网站资源视图进行配置调整后的网站资源视图,由网站信息请求设备根据查询到的网站资源视图向网站信息请求方提供所请求的网站内容信息。Preferably, the above-mentioned access retrieval module 40 is specifically configured to provide the established website resource view to the website information requesting device according to the view resource acquisition request sent by the website information requesting device, or to provide the established website resource view according to the configuration requirements in the view resource obtaining request. The website resource view is configured and adjusted and provided to the website information requesting device, and the website information requesting device provides the requested website content information to the website information requester according to the provided website resource view; or according to the view resource query request sent by the website information requesting device , open the query interface to the website information requesting device, provide the established website resource view to the website information requesting device through the query interface or provide the website resource view after configuring and adjusting the established website resource view according to the configuration requirements in the view resource query request, The website information requesting device provides the requested website content information to the website information requester according to the queried website resource view.

优选的,上述网站内容信息提供装置还包括本域控制模块60,用于控制搜索模块爬行搜索的搜索范围。Preferably, the above-mentioned website content information providing device further includes a local domain control module 60, configured to control the search scope of the crawling search by the search module.

上述网站内容信息提供系统的具体结构如图4所示,其中,网站内容信息提供装置包括搜索模块10、索引模块20、视图资源生成模块30和接入检索模块40、搜索策略管理模块50、本域控制模块60、系统管理模块70和入口(Portal)模块80。其中:The specific structure of the above-mentioned website content information providing system is shown in FIG. domain control module 60 , system management module 70 and portal (Portal) module 80 . in:

网站内容信息提供装置的接入检索模块40实现与网站信息请求设备之间的通信,从而实现根据建立的网站资源视图向网站信息请求方提供网站内容信息。接入检索模块40基于接口协议实现与外部设备如CDN资源总线、IDC业务平台等之间的数据交互,作为服务器角色提供鉴权功能,对外部设备的账号、密码进行用户认证,实现IDC引入网站的初始链接信息的导入,以及IDC网站内容资源信息数据的发送功能;属于为上层屏蔽不同底层接入方式的技术实现。The access retrieval module 40 of the website content information providing device implements communication with the website information requesting device, so as to provide website content information to the website information requester according to the established website resource view. The access retrieval module 40 implements data interaction with external devices such as CDN resource bus, IDC service platform, etc. based on the interface protocol, provides authentication function as a server role, and performs user authentication on the account number and password of the external device, so as to realize the introduction of IDC into the website The import of initial link information and the sending function of IDC website content resource information data; it belongs to the technical realization of shielding different bottom access methods for the upper layer.

Portal模块80提供管理员管理、维护、访问网站内容信息提供装置的门户,系统基于B/S架构,提供用户登录、日志查询、统计报表等功能所必需的页面(Web)操作与管理展示界面,属于用户交互层面的技术实现。Portal module 80 provides the portal for administrators to manage, maintain, and access website content information providing devices. The system is based on the B/S architecture and provides necessary page (Web) operation and management display interfaces for functions such as user login, log query, and statistical reports. It belongs to the technical realization of the user interaction level.

搜索模块10实现对引入网站的爬行搜索,利用标准http协议,根据网站信息请求设备和Portal模块80提供的引入网站的初始链接信息,以及搜索策略管理模块制定的搜索策略,在本域范围内,对IDC引入网站的内容进行检索,遍历该IDC网站本域内的所有链接信息,以及对应的链接对象及其属性信息。The search module 10 realizes the crawling search of the imported website, utilizes the standard http protocol, according to the initial link information of the imported website provided by the website information request device and the Portal module 80, and the search strategy formulated by the search strategy management module, within the scope of this domain, Retrieve the content of the website introduced by the IDC, and traverse all the link information in the domain of the IDC website, as well as the corresponding link objects and their attribute information.

索引模块20实现对爬行搜索到的链接信息相关的数据的编制索引,建立链接对象索引。通过解析由搜索模块10检索得到的链接信息以及网页、文件等链接对象的相关信息数据,通过抽取、关联、清洗、去重等多重处理后,实现生成标准IDC内容索引数据的功能,得到各链接信息对应的链接对象索引。The indexing module 20 implements indexing of data related to link information obtained through crawling search, and establishes a link object index. By analyzing the link information retrieved by the search module 10 and related information data of link objects such as webpages and files, after multiple processing such as extraction, association, cleaning, and deduplication, the function of generating standard IDC content index data is realized, and each link is obtained. The link object index corresponding to the information.

视图资源生成模块30,基于索引模块20生成的IDC内容索引数据,生成引入网站的网站资源视图,以便为用户提供调度时使用。The view resource generation module 30, based on the IDC content index data generated by the index module 20, generates a website resource view of the imported website, so as to provide users with scheduling.

接入检索模块40还可以实现在索引模块20建立的IDC内容索引数据的基础上,提供检索功能,供Portal模块及网站信息请求设备查询IDC引入网站的内容。The access retrieval module 40 can also provide a retrieval function on the basis of the IDC content index data established by the index module 20 for the Portal module and the website information requesting device to query the content of the website imported by the IDC.

搜索策略管理模块50用于允许管理员配置和管理搜索策略,如深度优先搜索、广度优先搜索、聚焦搜索等规则,供搜索模块10调用。The search strategy management module 50 is used to allow administrators to configure and manage search strategies, such as rules such as depth-first search, breadth-first search, focused search, etc., for calling by the search module 10 .

本域控制模块60配置和管理本域搜索策略,对搜索模块10的搜索范围进行控制,限定搜索操作在本域的引入网站内部进行,还是链接到其他域的IDC机房或服务器的链接对象。The local domain control module 60 configures and manages the local domain search strategy, controls the search range of the search module 10, and limits the search operation to be carried out inside the imported website of this domain, or to be linked to IDC computer rooms or servers of other domains.

系统管理模块70提供本地网管功能,该模块属于可选模块,对系统的可用性、设备性能、网络指标进行实时监测和管理,比如:实时获得网站内容信息提供系统的资源使用情况和健康状态;对系统中产生的告警信息进行统一收集,根据告警级别调用相应的策略进行处理;通过网管接口实现与上级网管系统的对接与数据采集传输;对监控产生的各种数据进行记录和分析,负责记录用户使用系统时的操作日志,实现对上下级系统查询记录的统计功能;自动生成常规报表和各种个性化报表,支持分析各类管理需要;配置外部网元相关配置信息,通过Portal展现,提供分级权限管理功能,确保不同角色的用户只能使用被授权的功能,只能查看和维护被授权的数据。The system management module 70 provides the local network management function, and this module belongs to an optional module, which monitors and manages system availability, device performance, and network indicators in real time, such as: obtaining the resource usage and health status of the website content information provision system in real time; The alarm information generated in the system is collected uniformly, and the corresponding strategy is called according to the alarm level for processing; the connection with the upper-level network management system and data collection and transmission are realized through the network management interface; various data generated by monitoring are recorded and analyzed, and the user is responsible for recording The operation log when using the system realizes the statistical function of query records of the upper and lower systems; automatically generates regular reports and various personalized reports to support the analysis of various management needs; configures related configuration information of external network elements, and displays them through the Portal to provide classification The authority management function ensures that users with different roles can only use authorized functions, and can only view and maintain authorized data.

上述网站内容信息提供装置支持通过IDC业务平台、Portal界面、CDN资源总线配置不同网站信息请求设备所需的IDC网站内容资源订阅需求,针对不同网站信息请求设备可生成不同的资源视图文件。为保证文件的安全和独立,网站内容信息提供装置应将针对不同网站信息请求设备的网站内容资源视图文件可以存放在不同的路径下,并通过不同的访问用户名和权限进行控制。The above-mentioned website content information providing device supports the configuration of IDC website content resource subscription requirements required by different website information requesting devices through the IDC business platform, Portal interface, and CDN resource bus, and can generate different resource view files for different website information requesting devices. In order to ensure the security and independence of the files, the website content information providing device should store the website content resource view files for different website information requesting devices in different paths, and control them through different access user names and permissions.

上述网站内容信息提供装置各模块之间的交互流程如图5所示,具体包括如下步骤:The interaction process between the modules of the above-mentioned website content information providing device is shown in Figure 5, which specifically includes the following steps:

步骤S21:搜索模块向搜索策略管理模块请求搜索策略。Step S21: The search module requests a search policy from the search policy management module.

网站内容信息提供装置中的搜索模块向搜索策略管理模块请求爬虫的搜索策略。The search module in the website content information providing device requests the search strategy of the crawler from the search strategy management module.

步骤S22:搜索策略管理模块将配置的搜索策略返回给搜索模块。Step S22: The search strategy management module returns the configured search strategy to the search module.

例如:搜索策略管理模块向搜索模块返回爬虫搜索策略。For example: the search strategy management module returns the crawler search strategy to the search module.

步骤S23:搜索模块向本域控制模块请求本域控制策略。Step S23: The search module requests the local domain control policy from the local domain control module.

步骤S24:本域控制模块将配置的本域控制策略返回给搜索模块。Step S24: the local domain control module returns the configured local domain control policy to the search module.

如上面方法部分所述根据本域控制策略可以确定搜索模块爬行搜索的范围。As described in the method section above, the crawling search scope of the search module can be determined according to the domain control policy.

步骤S25:搜索模块按照配置的搜索策略和本域控制策略进行爬行搜索。Step S25: The search module crawls and searches according to the configured search strategy and the domain control strategy.

搜索模块爬虫按照配置的搜索策略在本域控制策略指定的范围内获取指定网站的链接信息和对应的链接对象以及链接对象的属性信息。The search module crawler obtains the link information of the specified website, the corresponding link object and the attribute information of the link object within the range specified by the domain control policy according to the configured search policy.

具体实现过程参加步骤S11。Refer to step S11 for the specific implementation process.

步骤S26:搜索模块向索引模块发送搜索到的各链接信息的链接对象及其属性信息等数据。Step S26: The search module sends data such as the link object and its attribute information of each link information searched to the index module.

步骤S27:索引模块对搜索模块搜索到的数据进行处理并生成获取的链接信息对应的链接对象索引。Step S27: The index module processes the data searched by the search module and generates a link object index corresponding to the obtained link information.

具体实现过程参见步骤S12。Refer to step S12 for the specific implementation process.

步骤S28:索引模块向资源视图生成模块发送生成的链接对象索引等索引信息。Step S28: The index module sends index information such as the generated link object index to the resource view generation module.

步骤S29:资源视图生成模块处理索引数据生成网站资源视图。Step S29: The resource view generation module processes the index data to generate a website resource view.

具体实现过程参见步骤S13。Refer to step S13 for the specific implementation process.

上述描述了网站内容信息提供装置中各模块交互实现网站资源视图生成的过程。The foregoing describes the process of each module in the website content information providing device interacting to realize the generation of the website resource view.

上述网站内容信息提供装置用于IDC引入的网站内容的提供主要支持两类数据传送方式:一是提供文件传输协议(File Transfer Protocol,FTP)服务功能,网站信息请求设备先发起针对于特定引入网站的网站资源视图的获取请求,网站内容信息提供装置解析内容后生成对应范围的资源视图信息,供业务平台下载。二是支持与网站信息请求设备之间通过超文本传输协议(HyperTextTransfer Protocol,HTTP)+网页服务(WebService)方式的交互,由网站信息请求设备发起针对特定引入网站的网站资源视图的查询请求,网站内容信息提供装置向网站信息请求设备返回对应的资源视图信息。下面通过具体的实施例说明上述两种不同数据传送方式的网站内容信息提供方法的实现过程:The above-mentioned website content information providing device is used to provide the website content introduced by the IDC and mainly supports two types of data transmission methods: one is to provide the File Transfer Protocol (File Transfer Protocol, FTP) service function, and the website information request device first initiates a request for a specific imported website. In response to an acquisition request for a website resource view, the website content information providing device parses the content and generates resource view information of a corresponding range for downloading by the business platform. The second is to support the interaction with the website information request device through HyperText Transfer Protocol (HyperTextTransfer Protocol, HTTP) + web service (WebService). The website information request device initiates a query request for the website resource view of a specific imported website. The content information providing apparatus returns corresponding resource view information to the website information requesting device. The implementation process of the method for providing website content information in the above two different data transmission modes is described below through specific embodiments:

实施例一Embodiment one

本发明实施例一提供的网站内容信息提供方法,基于文件接口实现网站资源视图的提供网站资源视图的下载,其流程如图6所示,包括如下步骤:The method for providing website content information provided by Embodiment 1 of the present invention implements the website resource view based on the file interface to provide the download of the website resource view, and its flow is shown in Figure 6, including the following steps:

步骤S101:网站信息请求设备向网站内容信息提供装置传送网站的初始链接信息。Step S101: The website information requesting device transmits the initial link information of the website to the website content information providing device.

例如:操作人员通过IDC业务平台或Portal界面向网站内容信息提供装置传送IDC网站的原始信息,包括域名、初始爬行链接等。For example: the operator transmits the original information of the IDC website, including the domain name, the initial crawling link, etc., to the website content information providing device through the IDC business platform or Portal interface.

步骤S102:网站内容信息提供装置通过IDC业务平台接口在IDC网站服务器上爬行搜索。Step S102: The website content information providing device crawls and searches on the IDC website server through the interface of the IDC service platform.

步骤S103:从IDC网站服务器获取初始链接信息对应的各链接信息,以及各链接信息对应的链接对象和链接对象的属性信息。Step S103: Obtain each link information corresponding to the initial link information from the IDC website server, as well as the link object and the attribute information of the link object corresponding to each link information.

步骤S104:网站内容信息提供装置基于爬行获得的各链接信息对应的链接对象和链接对象的属性信息,建立链接对象索引。Step S104: The website content information providing device builds a link object index based on the link objects corresponding to each link information obtained by crawling and the attribute information of the link objects.

网站内容信息提供装置通过数据处理操作建立引入网站包括的各链接信息的链接对象索引。The website content information providing device establishes a link object index of each link information included in the imported website through data processing operations.

步骤S105:网站内容信息提供装置生成标准IDC网站资源视图。Step S105: the website content information providing device generates a standard IDC website resource view.

步骤S106:网站信息请求设备发送视图资源获取请求给网站内容信息提供装置。Step S106: the website information requesting device sends a view resource acquisition request to the website content information providing apparatus.

网站信息请求设备通过与网站内容信息提供装置的接口上传视图资源的配置要求及下载网站资源视图。如图6所示,网站信息请求设备可以是IDC业务平台或CDN资源总线。The website information requesting device uploads configuration requirements of view resources and downloads website resource views through the interface with the website content information providing device. As shown in FIG. 6, the website information requesting device may be an IDC service platform or a CDN resource bus.

步骤S107:网站内容信息提供装置根据视图资源获取请求中的配置要求对建立的网站资源视图进行配置调整。Step S107: The website content information providing device adjusts the configuration of the established website resource view according to the configuration requirements in the view resource acquisition request.

该步骤为可选步骤,当视图资源获取请求中不携带配置要求时,不执行该步骤。当视图资源获取请求中携带配置要求时,网站内容信息提供装置按照网站信息请求设备的配置要求,输出符合配置要求的IDC网站资源视图文件并存储于对应的路径下,供网站信息请求设备下载。This step is optional, and is not performed when the view resource acquisition request does not carry configuration requirements. When the configuration requirements are carried in the view resource acquisition request, the website content information providing device outputs the IDC website resource view files meeting the configuration requirements according to the configuration requirements of the website information requesting device and stores them in the corresponding path for downloading by the website information requesting device.

步骤S108:网站信息请求设备从网站内容信息提供装置下载请求获取的网站资源视图。Step S108: The website information requesting device downloads the requested website resource view from the website content information providing device.

网站信息请求设备根据自身需求与网站内容信息提供装置建立连接,从网站内容信息提供装置下载网站资源视图文件。The website information requesting device establishes a connection with the website content information providing device according to its own needs, and downloads the website resource view file from the website content information providing device.

实施例二Embodiment two

本发明实施例二提供的网站内容信息提供方法,基于实时查询接口实现网站资源视图的提供网站资源视图查询,其流程如图7所示,包括如下步骤:The method for providing website content information provided by Embodiment 2 of the present invention implements website resource view query based on a real-time query interface, and its process is shown in FIG.

步骤S201:网站信息请求设备向网站内容信息提供装置传送网站的初始链接信息。Step S201: The website information requesting device transmits the initial link information of the website to the website content information providing device.

网站内容信息提供装置对外提供WebService或者其它的实时消息接口。操作人员通过IDC业务平台或Portal界面向网站内容信息提供装置传送IDC网站的原始信息,包括域名、初始爬行链接等。The website content information providing device provides WebService or other real-time message interfaces to the outside. The operator transmits the original information of the IDC website, including domain name, initial crawl link, etc., to the website content information providing device through the IDC business platform or Portal interface.

步骤S202:网站内容信息提供装置通过IDC业务平台接口在IDC网站服务器上爬行搜索。Step S202: The website content information providing device crawls and searches on the IDC website server through the interface of the IDC service platform.

步骤S203:从IDC网站服务器获取初始链接信息对应的各链接信息,以及各链接信息对应的链接对象和链接对象的属性信息。Step S203: Obtain each link information corresponding to the initial link information from the IDC website server, as well as the link object and the attribute information of the link object corresponding to each link information.

步骤S204:网站内容信息提供装置基于爬行获得的各链接信息对应的链接对象和链接对象的属性信息,建立链接对象索引。Step S204: The website content information providing device establishes a link object index based on the link objects corresponding to each link information obtained by crawling and the attribute information of the link objects.

网站内容信息提供装置通过数据处理操作建立引入网站包括的各链接信息的链接对象索引。The website content information providing device establishes a link object index of each link information included in the imported website through data processing operations.

步骤S205:网站内容信息提供装置生成标准IDC网站资源视图。Step S205: the website content information providing device generates a standard IDC website resource view.

步骤S206:网站信息请求设备请求登陆网站内容信息提供装置。Step S206: the website information requesting device requests to log in to the website content information providing device.

网站信息请求设备需要获取网站资源视图时,向网站内容信息提供装置发出登陆请求。When the website information requesting device needs to obtain the website resource view, it sends a login request to the website content information providing device.

步骤S207:网站内容信息提供装置响应网站信息请求设备的登陆请求。Step S207: The website content information providing device responds to the login request of the website information requesting device.

可选的,网站内容信息提供装置可以在对网站信息请求设备进行鉴权后再允许业务平台登陆。Optionally, the website content information providing device may allow the service platform to log in after authenticating the website information requesting device.

步骤S208:网站信息请求设备发送的视图资源查询请求给网站内容信息提供装置。Step S208: The view resource query request sent by the website information requesting device to the website content information providing device.

网站信息请求设备通过与网站内容信息提供装置的接口上传视图资源的配置要求及查询网站资源视图。如图7所示,网站信息请求设备可以是IDC业务平台或CDN资源总线。The website information requesting device uploads configuration requirements of view resources and queries website resource views through the interface with the website content information providing device. As shown in FIG. 7 , the website information requesting device may be an IDC service platform or a CDN resource bus.

步骤S209:网站内容信息提供装置根据视图资源查询请求中的配置要求对建立的网站资源视图进行配置调整。Step S209: The website content information providing device adjusts the configuration of the established website resource view according to the configuration requirements in the view resource query request.

该步骤为可选步骤,当视图资源查询请求中不携带配置要求时,不执行该步骤。当视图资源查询请求中携带配置要求时,网站内容信息提供装置按照网站信息请求设备的配置要求,输出符合配置要求的IDC网站资源视图并存储于对应的路径下,供网站信息请求设备查询。This step is optional, and is not performed when the view resource query request does not carry configuration requirements. When the configuration requirement is carried in the view resource query request, the website content information providing device outputs the IDC website resource view that meets the configuration requirements according to the configuration requirements of the website information requesting device and stores it in a corresponding path for the website information requesting device to query.

步骤S210:网站内容信息提供装置响应网站信息请求设备的视图资源查询请求。Step S210: the website content information providing device responds to the view resource query request of the website information requesting device.

网站信息请求设备根据自身需求与网站内容信息提供装置建立连接,从网站内容信息提供装置查询网站资源视图。The website information requesting device establishes a connection with the website content information providing device according to its own needs, and queries the website resource view from the website content information providing device.

步骤S211:网站信息请求设备网站向内容信息提供装置发出登出请求。Step S211: The website information requesting device website sends a logout request to the content information providing device.

网站信息请求设备不需要再获取网站资源视图时,向网站内容信息提供装置发出登出请求。When the website information requesting device no longer needs to obtain the website resource view, it sends a logout request to the website content information providing device.

步骤S212:网站内容信息提供装置响应网站信息请求设备的登出请求。Step S212: The website content information providing device responds to the logout request of the website information requesting device.

网站内容信息提供装置注销网站信息请求设备的登陆信息。The website content information providing device logs out the login information of the website information requesting device.

本发明实施例提供的网站内容信息提供方法和装置,能够从IDC网站服务器中以HTTP方式自动访问、采集、获取网站内容信息,控制爬虫获取URL的范围,只获取已引入特定IDC域内的网站资源信息;对于获取的URL信息,支持进行URL关联、去重等处理,生成直至链接对象级别的链接对象索引信息;并根据不同网站信息请求设备的需求,支持灵活生成不同的网站资源视图信息,以提供给网站信息请求设备;既可以支持通过文件方式,根据需求生成网站资源视图文件,提供给网站信息请求设备;也可以通过支持基于消息的实时查询方式,网站信息请求设备可通过接口与网站内容信息提供装置交互,主动发起IDC网站资源视图查询请求,网站内容信息提供装置向网站信息请求设备返回所查询的网站资源视图。The method and device for providing website content information provided by the embodiments of the present invention can automatically access, collect, and obtain website content information from the IDC website server in HTTP mode, control the range of URLs obtained by crawlers, and only obtain website resources that have been introduced into a specific IDC domain information; for the acquired URL information, it supports processing such as URL association and deduplication, and generates link object index information up to the link object level; and supports flexible generation of different website resource view information according to the requirements of different website information request devices, so as to Provided to the website information request device; it can support the generation of website resource view files according to the demand through the file method, and provide it to the website information request device; it can also support the real-time query method based on the message, and the website information request device can communicate with the website content through the interface The information providing device interacts and actively initiates an IDC website resource view query request, and the website content information providing device returns the queried website resource view to the website information requesting device.

上述方法有效解决现阶段通过人工手工方式配置或采集IDC网站信息所引发的信息同步效率低下、准确性差、速度慢,同步不及时的缺陷,具有自动收集整合处理、效率高、实时性强的优点,可进一步优化IDC网站信息提供的及时性和准确率,以加强CDN网络对网站资源智能调度的能力。The above method effectively solves the defects of low information synchronization efficiency, poor accuracy, slow speed, and untimely synchronization caused by manual configuration or collection of IDC website information at this stage, and has the advantages of automatic collection and integration processing, high efficiency, and strong real-time performance , can further optimize the timeliness and accuracy of IDC website information, so as to strengthen the ability of CDN network to intelligently dispatch website resources.

上述方法将IDC网站资源信息的处理集中在新增的网站内容信息提供装置中实现,避免了所有网站信息请求设备均进行IDC网站信息整合处理的操作,有效降低了对于实现IDC网站内容管理的复杂度和功能要求,降低了业务侧设备的建设和投资成本,为网站信息请求设备快速、高效的获取IDC网站资源信息提供良好的解决方案。The above method centralizes the processing of IDC website resource information in the newly added website content information providing device, avoids the operation of all website information requesting devices to perform integrated processing of IDC website information, and effectively reduces the complexity of implementing IDC website content management. It reduces the construction and investment costs of business-side equipment, and provides a good solution for website information request equipment to quickly and efficiently obtain IDC website resource information.

上述说明示出并描述了本发明的优选实施例,但如前所述,应当理解本发明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文所述发明构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权力要求的保护范围内。The foregoing description shows and describes preferred embodiments of the present invention, but as previously stated, it should be understood that the present invention is not limited to the form disclosed herein, and should not be viewed as excluding other embodiments, but can be used in various Other combinations, modifications and circumstances, and can be modified within the scope of the inventive concept described herein, by the above teachings or by skill or knowledge in the relevant field. However, changes and changes made by those skilled in the art do not depart from the spirit and scope of the present invention, and should all be within the protection scope of the appended claims of the present invention.

Claims (11)

1.一种网站内容信息提供方法,其特征在于,包括:1. A method for providing website content information, comprising: 根据获得的引入网站的初始链接信息进行爬行搜索,获取到所述引入网站包括的链接信息,并获取所述链接信息的链接对象及其属性信息;performing a crawl search according to the obtained initial link information of the imported website, obtaining the link information included in the imported website, and obtaining the link object and its attribute information of the linked information; 根据获取的所述链接信息的链接对象及其属性信息,建立所述链接信息对应的链接对象索引;Establishing a link object index corresponding to the link information according to the acquired link object and attribute information of the link information; 根据各所述链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图;所述网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引;According to the association relationship between the link object indexes of each of the link information, establish the website resource view of each imported website; the website resource view includes the link object indexes corresponding to the link information of each imported website arranged according to the set rules; 根据建立的网站资源视图向网站信息请求方提供网站内容信息。Provide website content information to the website information requester based on the established website resource view. 2.如权利要求1所述的方法,其特征在于,所述根据获得的引入网站的初始链接信息进行爬行搜索,具体包括:2. The method according to claim 1, wherein the crawling search is performed according to the obtained initial link information of the introduced website, specifically comprising: 根据获得的引入网站的初始链接信息和预先配置的搜索策略进行爬行搜索,其中搜索策略包括下列策略之一或组合:深度优先策略、广度优先策略和聚焦搜索策略。The crawling search is performed according to the obtained initial link information of the introduced website and a pre-configured search strategy, wherein the search strategy includes one or a combination of the following strategies: depth-first strategy, breadth-first strategy and focused search strategy. 3.如权利要求1所述的方法,其特征在于,所述根据获取的所述链接信息的链接对象及其属性信息,建立所述链接信息对应的链接对象索引,具体包括:3. The method according to claim 1, wherein the link object index corresponding to the link information is established according to the acquired link object and its attribute information of the link information, specifically comprising: 根据获取的所述链接信息对应的链接对象及其属性信息构建各链接对象的内容索引,以及根据所述链接信息的路径信息确定所述链接信息之间的关联关系;建立起包括各链接信息关联关系以及各链接信息对应的链接对象的内容索引的链接对象索引。Construct the content index of each link object according to the link object corresponding to the obtained link information and its attribute information, and determine the association relationship between the link information according to the path information of the link information; establish an association including each link information The link object index of the content index of the link object corresponding to the relation and each link information. 4.如权利要求1所述的方法,其特征在于,所述根据建立的网站资源视图向网站信息请求方提供网站内容信息,具体包括:4. The method according to claim 1, wherein the providing website content information to the website information requester according to the established website resource view specifically includes: 根据网站信息请求设备发送的视图资源获取请求,将建立的网站资源视图提供给网站信息请求设备或根据视图资源获取请求中的配置要求对建立的网站资源视图进行配置调整后提供给网站信息请求设备,由网站信息请求设备根据提供的网站资源视图向网站信息请求方提供所请求的网站内容信息;或According to the view resource acquisition request sent by the website information requesting device, provide the established website resource view to the website information requesting device or provide the website information requesting device after configuring and adjusting the established website resource view according to the configuration requirements in the view resource obtaining request , the website information requesting device provides the requested website content information to the website information requester according to the provided website resource view; or 根据网站信息请求设备发送的视图资源查询请求,向网站信息请求设备开放查询接口,通过查询接口向网站信息请求设备提供建立的网站资源视图或提供根据视图资源查询请求中的配置要求对建立的网站资源视图进行配置调整后的网站资源视图;由网站信息请求设备根据查询到的网站资源视图向网站信息请求方提供所请求的网站内容信息站点管理服务。Open the query interface to the website information requesting device according to the view resource query request sent by the website information requesting device, and provide the established website resource view to the website information requesting device through the query interface or provide the established website according to the configuration requirements in the view resource query request The resource view is a website resource view after configuration and adjustment; the website information requesting device provides the requested website content information site management service to the website information requester according to the queried website resource view. 5.一种网站内容信息提供装置,其特征在于,包括:5. A website content information providing device, characterized in that it comprises: 搜索模块,用于根据获得的引入网站的初始链接信息进行爬行搜索,获取到所述引入网站包括的链接信息,并获取所述链接信息的链接对象及其属性信息;A search module, configured to crawl and search according to the obtained initial link information of the imported website, obtain the link information included in the imported website, and obtain the link object and its attribute information of the linked information; 索引模块,用于根据获取的链接对象及其属性信息,建立所述链接信息对应的链接对象索引;An index module, configured to establish a link object index corresponding to the link information according to the acquired link object and its attribute information; 视图资源生成模块,用于根据各所述链接信息的链接对象索引之间的关联关系,建立各引入网站的网站资源视图;所述网站资源视图中包括按设定规则排列的各引入网站的链接信息对应的链接对象索引;The view resource generation module is used to establish the website resource view of each imported website according to the association relationship between the link object indexes of each of the link information; the website resource view includes the links of each imported website arranged according to the set rules The link object index corresponding to the information; 接入检索模块,用于根据建立的网站资源视图向网站信息请求方提供网站内容信息。An access retrieval module is used to provide website content information to the website information requester according to the established website resource view. 6.如权利要求5所述的装置,其特征在于,还包括:搜索策略管理模块;6. The device according to claim 5, further comprising: a search policy management module; 所述搜索策略管理模块,用于配置搜索策略,所述搜索策略包括但不限于下列策略之一或组合:深度优先策略、广度优先策略和聚焦搜索策略The search strategy management module is configured to configure a search strategy, the search strategy includes but is not limited to one or a combination of the following strategies: depth-first strategy, breadth-first strategy and focused search strategy 所述搜索模块,具体用于根据获得的引入网站的初始链接信息和预先配置的搜索策略进行爬行搜索。The search module is specifically configured to perform crawling search according to the obtained initial link information of the imported website and a pre-configured search strategy. 7.如权利要求5所述的装置,其特征在于,所述索引模块,具体用于:7. The device according to claim 5, wherein the index module is specifically used for: 根据获取的所述链接信息包括的链接对象及其属性信息构建各链接对象的内容索引,以及根据所述链接信息的路径信息确定所述链接信息之间的关联关系;建立起包括各链接信息关联关系以及各链接信息对应的链接对象的内容索引的链接对象索引。Construct the content index of each link object according to the link objects and their attribute information included in the obtained link information, and determine the association relationship between the link information according to the path information of the link information; establish an association including each link information The link object index of the content index of the link object corresponding to the relation and each link information. 8.如权利要求5所述的装置,其特征在于,所述接入检索模块,具体用于:8. The device according to claim 5, wherein the access retrieval module is specifically used for: 根据网站信息请求设备发送的视图资源获取请求,将建立的网站资源视图提供给网站信息请求设备或根据视图资源获取请求中的配置要求对建立的网站资源视图进行配置调整后提供给网站信息请求设备,由网站信息请求设备根据提供的网站资源视图向网站信息请求方提供所请求的网站内容信息;或According to the view resource acquisition request sent by the website information requesting device, provide the established website resource view to the website information requesting device or provide the website information requesting device after configuring and adjusting the established website resource view according to the configuration requirements in the view resource obtaining request , the website information requesting device provides the requested website content information to the website information requester according to the provided website resource view; or 根据网站信息请求设备发送的视图资源查询请求,向网站信息请求设备开放查询接口,通过查询接口向网站信息请求设备提供建立的网站资源视图或提供根据视图资源查询请求中的配置要求对建立的网站资源视图进行配置调整后的网站资源视图;由网站信息请求设备根据查询到的网站资源视图向网站信息请求方提供所请求的网站内容信息。Open the query interface to the website information requesting device according to the view resource query request sent by the website information requesting device, and provide the established website resource view to the website information requesting device through the query interface or provide the established website according to the configuration requirements in the view resource query request The resource view is a website resource view after configuration and adjustment; the website information requesting device provides the requested website content information to the website information requester according to the queried website resource view. 9.如权利要求5-8任一所述的装置,其特征在于,所述索引模块还用于:9. The device according to any one of claims 5-8, wherein the index module is further used for: 在建立爬行搜索到的所述链接信息对应的链接对象索引之前,对爬行搜索到的链接信息,以及链接信息的链接对象及其属性信息进行数据去重处理。Before establishing the link object index corresponding to the link information searched by crawling, data deduplication processing is performed on the link information searched by crawling, as well as the link objects of the link information and their attribute information. 10.如权利要求5-8任一所述的装置,其特征在于,还包括:10. The device according to any one of claims 5-8, further comprising: 本域控制模块,用于控制搜索模块爬行搜索的搜索范围。This domain control module is used to control the search scope of the crawling search of the search module. 11.一种网站内容信息提供系统,其特征在于,包括如权利要求5-10任一所述的网站内容信息提供装置和至少一个网站信息请求设备。11. A system for providing website content information, characterized by comprising the apparatus for providing website content information according to any one of claims 5-10 and at least one website information requesting device.
CN2011103626460A 2011-11-16 2011-11-16 Providing method, system and device of website content information Pending CN103116580A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2011103626460A CN103116580A (en) 2011-11-16 2011-11-16 Providing method, system and device of website content information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2011103626460A CN103116580A (en) 2011-11-16 2011-11-16 Providing method, system and device of website content information

Publications (1)

Publication Number Publication Date
CN103116580A true CN103116580A (en) 2013-05-22

Family

ID=48414957

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2011103626460A Pending CN103116580A (en) 2011-11-16 2011-11-16 Providing method, system and device of website content information

Country Status (1)

Country Link
CN (1) CN103116580A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103532944A (en) * 2013-10-08 2014-01-22 百度在线网络技术(北京)有限公司 Method and device for capturing unknown attack
CN104484424A (en) * 2014-12-19 2015-04-01 浪潮通用软件有限公司 Establishing method for resource price information base of construction enterprise based on internet
CN105183919A (en) * 2015-10-13 2015-12-23 郑州悉知信息科技股份有限公司 Deployment method and device for internal links of website
CN106168977A (en) * 2016-07-15 2016-11-30 河南山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN109542402A (en) * 2018-10-12 2019-03-29 杭州工跃机械制造有限公司 A method of adaptive is used for more portal website's seamless switchings
CN110008390A (en) * 2019-02-27 2019-07-12 深圳壹账通智能科技有限公司 Appraisal procedure, device, computer equipment and the storage medium of application program
CN113660178A (en) * 2021-06-30 2021-11-16 新浪网技术(中国)有限公司 CDN content management system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038610A (en) * 1996-07-17 2000-03-14 Microsoft Corporation Storage of sitemaps at server sites for holding information regarding content
US7296222B1 (en) * 1999-04-16 2007-11-13 International Business Machines Corporation Method and system for preparing and displaying page structures for web sites
US7599920B1 (en) * 2006-10-12 2009-10-06 Google Inc. System and method for enabling website owners to manage crawl rate in a website indexing system
CN102124460A (en) * 2008-04-04 2011-07-13 微软公司 Standard schema and user interface for website maps

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6038610A (en) * 1996-07-17 2000-03-14 Microsoft Corporation Storage of sitemaps at server sites for holding information regarding content
US7296222B1 (en) * 1999-04-16 2007-11-13 International Business Machines Corporation Method and system for preparing and displaying page structures for web sites
US7599920B1 (en) * 2006-10-12 2009-10-06 Google Inc. System and method for enabling website owners to manage crawl rate in a website indexing system
CN102124460A (en) * 2008-04-04 2011-07-13 微软公司 Standard schema and user interface for website maps

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103532944A (en) * 2013-10-08 2014-01-22 百度在线网络技术(北京)有限公司 Method and device for capturing unknown attack
CN103532944B (en) * 2013-10-08 2016-09-07 百度在线网络技术(北京)有限公司 A kind of method and apparatus capturing unknown attack
CN104484424A (en) * 2014-12-19 2015-04-01 浪潮通用软件有限公司 Establishing method for resource price information base of construction enterprise based on internet
CN105183919A (en) * 2015-10-13 2015-12-23 郑州悉知信息科技股份有限公司 Deployment method and device for internal links of website
CN105183919B (en) * 2015-10-13 2018-10-12 郑州悉知信息科技股份有限公司 The dispositions method and device of chain in a kind of website
CN106168977A (en) * 2016-07-15 2016-11-30 河南山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN106168977B (en) * 2016-07-15 2019-07-02 山谷网安科技股份有限公司 A kind of column recognition methods for web portal security monitoring
CN109542402A (en) * 2018-10-12 2019-03-29 杭州工跃机械制造有限公司 A method of adaptive is used for more portal website's seamless switchings
CN110008390A (en) * 2019-02-27 2019-07-12 深圳壹账通智能科技有限公司 Appraisal procedure, device, computer equipment and the storage medium of application program
CN113660178A (en) * 2021-06-30 2021-11-16 新浪网技术(中国)有限公司 CDN content management system

Similar Documents

Publication Publication Date Title
CN103116580A (en) Providing method, system and device of website content information
US9894049B2 (en) Network aggregator
KR100514149B1 (en) A method for searching and analysing information in data networks
CN103812882B (en) A kind of method and system of file transmission
CN103034483B (en) web page script management method and system
US8706756B2 (en) Method, system and apparatus of hybrid federated search
WO2017167050A1 (en) Configuration information generation and transmission method, and resource loading method, apparatus and system
US10771353B2 (en) Policy enforcement as a service for third party platforms with asynchronous user tracking mechanisms
TWI557571B (en) Method, computer-storage media, and server for optimizing web crawling with user history
US20180287920A1 (en) Intercepting application traffic monitor and analyzer
CN107135119B (en) Business response tracking and interface state monitoring development system
WO2016082289A1 (en) Content distribution network (cdn)-based website acceleration method and system
CN102739811B (en) The method and apparatus of domain name mapping
CN102663049B (en) A kind of renewal search engine URL library method and device
US20140096237A1 (en) Information processing system, access right management method, information processing apparatus and control method and control program therefor
CN117321589A (en) Web scraping and its applications by using proxies
CN104683313B (en) Multimedia service processing device, method and system
KR20110039513A (en) Content Distribution Management System and Method
CN104935653A (en) A bypass cache method and device for accessing hot resources
KR101780802B1 (en) Method and apparatus for managing device context by using ip address in communication system
CN103729380A (en) Data processing method, system and device
US9055113B2 (en) Method and system for monitoring flows in network traffic
US20040267929A1 (en) Method, system and computer program products for adaptive web-site access blocking
KR20170088950A (en) Method and apparatus for providing website authentication data for search engine
WO2015123990A1 (en) Page push method, device, server and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20130522