CN101984429B - Method and device for acquiring destination page, search engine and browser - Google Patents
Method and device for acquiring destination page, search engine and browser Download PDFInfo
- Publication number
- CN101984429B CN101984429B CN2010105314609A CN201010531460A CN101984429B CN 101984429 B CN101984429 B CN 101984429B CN 2010105314609 A CN2010105314609 A CN 2010105314609A CN 201010531460 A CN201010531460 A CN 201010531460A CN 101984429 B CN101984429 B CN 101984429B
- Authority
- CN
- China
- Prior art keywords
- page
- path
- dom
- target page
- state path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
Description
技术领域 technical field
本发明涉及互联网技术,特别涉及一种获取目标页面的方法、装置、搜索引擎和浏览器。The invention relates to Internet technology, in particular to a method, device, search engine and browser for acquiring a target page.
背景技术 Background technique
随着网络的迅速发展,互联网成为大量信息的载体,如何有效地提取并利用这些信息成为一个巨大的挑战。搜索引擎作为一个辅助人们检索信息的工具成为用户访问互联网的入口和指南。网络爬虫(Spider)是一个自动提取网页的程序,是搜索引擎的重要组成。With the rapid development of the network, the Internet has become the carrier of a large amount of information, how to effectively extract and use this information has become a huge challenge. As a tool to assist people in retrieving information, search engines become the entrance and guide for users to access the Internet. A web crawler (Spider) is a program that automatically extracts web pages and is an important component of search engines.
传统网络爬虫从一个或若干个初始网页的统一资源定位符(URL)开始,抓取该URL的基础页面,并对当前的基础页面内容进行解析获取目标页面的URL,并进行数据处理,包括建立网页摘要、快照、索引并存储后,返回给浏览器供用户选择。Traditional web crawlers start from the Uniform Resource Locator (URL) of one or several initial web pages, grab the basic page of the URL, and analyze the content of the current basic page to obtain the URL of the target page, and perform data processing, including establishing After the web page is summarized, snapshotted, indexed and stored, it is returned to the browser for the user to choose.
然而,传统网络爬虫在获取目标页面的URL时,仅能够抓取静态页面,但随着互联网技术的不断发展,页面的内容从以前的静态方式转变为动态方式生成数据,传统网络爬虫技术显然不能满足这一转变需求,即不能够抓取页面的动态内容。However, traditional web crawlers can only grab static pages when obtaining the URL of the target page. To meet the needs of this transformation, the dynamic content of the page cannot be crawled.
发明内容 Contents of the invention
本发明提供了一种获取目标页面的方法、装置、搜索引擎和浏览器,以便于搜索引擎在搜索目标页面时能够抓取页面中的动态内容。The invention provides a method, device, search engine and browser for acquiring a target page, so that the search engine can grab the dynamic content in the page when searching the target page.
具体技术方案如下:The specific technical scheme is as follows:
一种获取目标页面的方法,该方法包括以下步骤:A method for obtaining a target page, the method includes the following steps:
A、抓取所接收到统一资源定位符URL对应的基础页面以及该基础页面的脚本;A. Grab the base page corresponding to the received uniform resource locator URL and the script of the base page;
B、对抓取的基础页面中下载到的DOM节点进行分析,判断DOM节点中DOM事件对应的脚本是否产生动态信息,根据分析结果产生与所述基础页面对应的一条以上包含所述动态信息的状态路径,利用产生的状态路径抓取目标页面;其中,所述状态路径包含:基础页面的URL、基础页面中产生动态信息的文档对象模型DOM事件的位置信息以及所述DOM事件对应的回调函数索引。B, analyze the DOM node downloaded in the basic page of grabbing, judge whether the script corresponding to the DOM event in the DOM node produces dynamic information, produce more than one corresponding to the basic page according to the analysis result and include the dynamic information A state path, using the generated state path to grab the target page; wherein, the state path includes: the URL of the base page, the location information of the DOM event that generates dynamic information in the base page, and the callback function corresponding to the DOM event index.
其中,所述步骤B具体包括:Wherein, the step B specifically includes:
在所述基础页面以及脚本的抓取过程中下载各DOM节点,依次对下载到的DOM节点执行步骤B11至B13,直至结束所有DOM节点的下载后,执行步骤B14;Download each DOM node during the crawling process of the basic page and script, and execute steps B11 to B13 to the downloaded DOM nodes in turn, until after the download of all DOM nodes ends, execute step B14;
B11、判断当前下载到的DOM节点是否为script标签,如果是,对下一个下载到的DOM节点转至步骤B11,否则,执行步骤B12;B11, judging whether the currently downloaded DOM node is a script tag, if yes, go to step B11 for the next downloaded DOM node, otherwise, execute step B12;
B12、判断当前下载到的DOM节点是否含有DOM事件以及回调函数,如果否,对下一个下载到的DOM节点转至步骤B11,如果是,执行步骤B13;B12, judging whether the currently downloaded DOM node contains a DOM event and a callback function, if not, proceed to step B11 for the next downloaded DOM node, if yes, execute step B13;
B13、利用当前下载到的DOM节点包含的DOM事件产生状态路径,并将产生的状态路径保存在状态路径队列中,对下一个下载到的DOM节点转至步骤B11;B13. Utilize the DOM events contained in the currently downloaded DOM node to generate a state path, and save the generated state path in the state path queue, and go to step B11 for the next downloaded DOM node;
B14、逐一获取状态队列中的各状态路径所对应的目标页面,判断是否产生新的页面内容或发生页面跳转,将产生新的页面内容或发生页面跳转的状态路径确定为所述基础页面对应的状态路径。B14. Obtain the target pages corresponding to each state path in the state queue one by one, judge whether to generate new page content or page jump, and determine the state path that generates new page content or page jump as the basic page Corresponding state path.
或者,所述步骤B具体包括:Or, the step B specifically includes:
在所述基础页面以及脚本的抓取过程中下载各DOM节点,依次对下载到的DOM节点执行步骤B21至B23,直至结束所有DOM节点的下载;Download each DOM node during the crawling process of the basic page and script, and execute steps B21 to B23 to the downloaded DOM nodes in turn, until the download of all DOM nodes ends;
B21、判断当前下载到的DOM节点是否为script标签,如果是,对下一个下载到的DOM节点转至步骤B21,否则执行步骤B22;B21, judge whether the currently downloaded DOM node is a script tag, if so, go to step B21 for the next downloaded DOM node, otherwise execute step B22;
B22、判断当前下载到的DOM节点是否含有DOM事件以及回调函数,如果否,对下一个下载到的DOM节点转至步骤B21,如果是,执行步骤B23;B22, judge whether the currently downloaded DOM node contains a DOM event and a callback function, if not, go to step B21 for the next downloaded DOM node, if yes, execute step B23;
B23、利用当前下载到的DOM节点包含的DOM事件产生状态路径;B23, using the DOM event contained in the currently downloaded DOM node to generate a state path;
B24、获取该状态路径所对应的目标页面,判断是否产生新的页面内容或产生页面跳转,如果是,确定该状态路径为所述基础页面对应的状态路径,对下一个下载到的DOM节点转至步骤B21;否则对下一个下载到的DOM节点转至步骤B21。B24, obtain the target page corresponding to the state path, judge whether to generate new page content or generate a page jump, if so, determine that the state path is the state path corresponding to the basic page, and download to the next DOM node Go to step B21; otherwise, go to step B21 for the next downloaded DOM node.
上述方式中,判断是否发生页面跳转包括:如果获取的目标页面和所述基础页面的URL不同,则确定发生页面跳转。In the above manner, judging whether a page jump occurs includes: if the acquired target page is different from the URL of the basic page, determining that a page jump occurs.
具体地,判断是否产生新的页面内容包括:将获取的目标页面和所述基础页面进行句子签名或字符串比对,如果比对结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容;或者,Specifically, judging whether to generate new page content includes: comparing the acquired target page and the base page with sentence signatures or character strings, and if the comparison result shows that the target page and the base page have different page content, then determine to generate new page content; or,
计算获取的目标页面和所述基础页面的相似度,如果计算结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容。Calculate the similarity between the acquired target page and the base page, and determine to generate new page content if the calculation result shows that the target page and the base page have different page content.
其中,所述DOM事件的位置信息包括:DOM节点标识、DOM节点的路径Xpath以及DOM事件标识。Wherein, the location information of the DOM event includes: a DOM node identifier, a path Xpath of the DOM node, and a DOM event identifier.
更进一步地,在所述步骤B之后,该方法还包括:Further, after the step B, the method also includes:
C、存储步骤B产生的基础页面对应的状态路径以及所抓取目标页面的快照,建立并存储目标页面的索引。C. Store the state path corresponding to the basic page generated in step B and the snapshot of the captured target page, and establish and store an index of the target page.
一种获取目标页面的方法,基于上述方法之后,包括:A method for obtaining a target page, based on the above method, includes:
接收到来自浏览器的搜索请求后,将搜索请求所包含关键词与存储的目标页面的索引进行匹配,将匹配的目标页面所对应的状态路径包含在搜索结果中返回给浏览器,供浏览器利用用户选择的状态路径获取对应的目标页面。After receiving the search request from the browser, match the keywords contained in the search request with the index of the stored target page, include the state path corresponding to the matched target page in the search result and return it to the browser for the browser to Use the state path selected by the user to obtain the corresponding target page.
另外,所述搜索结果中还可以包括:所述匹配的目标页面的快照信息;In addition, the search result may also include: snapshot information of the matched target page;
接收到浏览器返回的用户选择的目标页面的快照信息后,向所述浏览器返回对应的目标页面的快照。After receiving the snapshot information of the target page selected by the user returned by the browser, return the snapshot of the corresponding target page to the browser.
更进一步地,在所述将匹配的目标页面所对应的状态路径包含在搜索结果中返回给浏览器之后,该方法还包括:Furthermore, after the state path corresponding to the matched target page is included in the search result and returned to the browser, the method further includes:
接收到所述浏览器发送来的用户选择的状态路径后,根据用户选择的状态路径向目标页面站点发送目标页面请求,以便于所述目标页面站点向所述浏览器推送目标页面。After receiving the state path selected by the user sent by the browser, a target page request is sent to the target page site according to the state path selected by the user, so that the target page site pushes the target page to the browser.
一种获取目标页面的方法,该方法包括:A method of obtaining a target page, the method comprising:
浏览器向搜索引擎发送搜索请求后,接收搜索引擎返回的包含状态路径的搜索结果;After the browser sends a search request to the search engine, it receives the search result including the state path returned by the search engine;
根据用户选择的状态路径,向目标页面站点发送目标页面请求;Send a target page request to the target page site according to the state path selected by the user;
接收所述目标页面站点推送的目标页面;receiving the target page pushed by the target page site;
其中,所述包含状态路径的搜索结果是所述搜索引擎采用权利要求8所述方法返回的。Wherein, the search result including the state path is returned by the search engine by using the method described in claim 8 .
一种获取目标页面的装置,该装置包括:A device for acquiring a target page, the device comprising:
第一抓取单元,用于抓取所接收到统一资源定位符URL对应的基础页面以及该基础页面的脚本;The first grabbing unit is used to grab the basic page corresponding to the received uniform resource locator URL and the script of the basic page;
分析单元,用于对所述第一抓取单元抓取的基础页面中下载到的DOM节点进行分析,判断DOM节点中DOM事件对应的脚本是否产生动态信息,根据分析结果产生与所述基础页面对应的一条以上包含所述动态信息的状态路径;其中,所述状态路径包含:基础页面的URL、基础页面中产生动态信息的文档对象模型DOM事件的位置信息以及所述DOM事件对应的回调函数索引;The analysis unit is used to analyze the DOM node downloaded in the basic page captured by the first grabbing unit, determine whether the script corresponding to the DOM event in the DOM node generates dynamic information, and generate dynamic information corresponding to the basic page according to the analysis result Corresponding to more than one state path containing the dynamic information; wherein, the state path includes: the URL of the base page, the location information of the Document Object Model DOM event that generates the dynamic information in the base page, and the callback function corresponding to the DOM event index;
第二抓取单元,用于利用所述分析单元产生的状态路径抓取目标页面。The second grabbing unit is used to grab the target page by using the state path generated by the analysis unit.
其中,所述分析单元具体包括:第一判断模块、第二判断模块、第一路径生成模块和第一路径确定模块;Wherein, the analysis unit specifically includes: a first judging module, a second judging module, a first path generating module and a first path determining module;
所述第一抓取单元在所述基础页面及其脚本的抓取过程中下载各DOM节点,并将当前下载到的DOM节点发送给所述第一判断模块,直至结束所有DOM节点的下载后,向所述第一路径确定模块发送确定通知;The first grabbing unit downloads each DOM node during the grabbing process of the basic page and its script, and sends the currently downloaded DOM node to the first judging module until the download of all DOM nodes ends , sending a determination notification to the first path determination module;
所述第一判断模块,用于判断当前下载到的DOM节点是否为script标签,如果是,触发所述第一抓取单元下载下一个DOM节点,否则,向所述第二判断模块发送判断通知;The first judging module is used to judge whether the currently downloaded DOM node is a script tag, if so, trigger the first grabbing unit to download the next DOM node, otherwise, send a judgment notification to the second judging module ;
所述第二判断模块,用于判断当前下载到的DOM节点是否含有DOM事件以及回调函数,如果否,触发所述第一抓取单元下载下一个DOM节点,如果是,向所述第一路径生成模块发送执行通知;The second judging module is used to judge whether the currently downloaded DOM node contains a DOM event and a callback function, if not, trigger the first grabbing unit to download the next DOM node, and if so, send to the first path The generation module sends an execution notification;
所述第一路径生成模块,用于接收到所述执行通知后,利用当前下载到的DOM节点产生状态路径,并将产生的状态路径保存在状态路径队列中,触发所述第一抓取单元下载下一个DOM节点;The first path generation module is configured to use the currently downloaded DOM node to generate a state path after receiving the execution notification, and save the generated state path in the state path queue, and trigger the first grabbing unit Download the next DOM node;
所述第一路径确定模块,用于接收到所述确定通知时,触发所述第二抓取单元逐一获取状态队列中各状态路径对应的目标页面,根据所述第二抓取单元的获取结果判断是否产生新的页面内容或发生页面跳转,将产生的新的页面内容或发生页面跳转的状态路径确定为所述基础页面对应的状态路径。The first path determination module is configured to trigger the second grabbing unit to acquire the target pages corresponding to each status path in the status queue one by one when receiving the determination notification, and according to the acquisition result of the second grabbing unit Judging whether new page content is generated or page jump occurs, and the generated new page content or the state path where page jump occurs is determined as the state path corresponding to the basic page.
具体地,所述分析单元可以包括:第三判断模块、第四判断模块、第二路径生成模块和第二路径确定模块;Specifically, the analysis unit may include: a third judging module, a fourth judging module, a second path generating module, and a second path determining module;
所述第一抓取单元在所述基础页面及其脚本的抓取过程中下载各DOM节点,并将当前下载到的DOM节点发送给所述第三判断模块,直至结束所有DOM节点的下载;The first grabbing unit downloads each DOM node during the grabbing process of the basic page and its script, and sends the currently downloaded DOM node to the third judging module until the download of all DOM nodes ends;
所述第三判断模块,用于判断当前下载的DOM节点是否为script标签,如果是,触发所述第一抓取单元下载下一个DOM节点,否则,向所述第四判断模块发送判断通知;The third judging module is used to judge whether the currently downloaded DOM node is a script tag, if so, trigger the first grabbing unit to download the next DOM node, otherwise, send a judgment notification to the fourth judging module;
所述第四判断模块,用于判断当前下载到的DOM节点是否含有DOM事件以及回调函数,如果否,触发所述第一抓取单元下载下一个DOM节点,如果是,向所述第二路径生成模块发送执行通知;The fourth judging module is used to judge whether the currently downloaded DOM node contains a DOM event and a callback function, if not, trigger the first grabbing unit to download the next DOM node, and if so, send to the second path The generation module sends an execution notification;
所述第二路径生成模块,用于接收到执行通知时,利用当前下载到的DOM节点包含的DOM事件产生状态路径,并将产生的状态路径发送给所述第二路径确定模块;The second path generation module is configured to use the DOM event contained in the currently downloaded DOM node to generate a state path when receiving the execution notification, and send the generated state path to the second path determination module;
第二路径确定模块,用于接收到状态路径时,触发所述第二抓取单元获取该状态路径所对应的目标页面,根据所述第二抓取单元的获取结果,判断是否产生新的页面内容或产生页面跳转,如果是,确定该状态路径为所述基础页面对应的状态路径,触发所述第一抓取单元下载下一个DOM节点,否则触发所述第一抓取单元下载下一个DOM节点。The second path determination module is configured to trigger the second grabbing unit to acquire the target page corresponding to the status path when receiving the status path, and judge whether to generate a new page according to the acquisition result of the second grabbing unit Content or generate a page jump, if so, determine that the state path is the state path corresponding to the basic page, trigger the first grabbing unit to download the next DOM node, otherwise trigger the first grabbing unit to download the next DOM node DOM node.
其中,判断是否发生页面跳转包括:如果获取的目标页面和所述基础页面的URL不同,则确定发生页面跳转。Wherein, judging whether a page jump occurs includes: if the acquired target page is different from the URL of the basic page, determining that a page jump occurs.
判断是否产生新的页面内容包括:将获取的目标页面和所述基础页面进行句子签名或字符串比对,如果比对结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容;或者,Judging whether to generate new page content includes: comparing the acquired target page and the base page with sentence signatures or character strings, and if the comparison result shows that the target page and the base page have different page content, then determine to generate a new page content; or,
计算获取的目标页面和所述基础页面的相似度,如果计算结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容。Calculate the similarity between the acquired target page and the base page, and determine to generate new page content if the calculation result shows that the target page and the base page have different page content.
具体地,所述DOM事件的位置信息包括:DOM节点标识、DOM节点的路径Xpath以及DOM事件标识。Specifically, the location information of the DOM event includes: a DOM node identifier, a path Xpath of the DOM node, and a DOM event identifier.
更进一步地,该装置还包括:Furthermore, the device also includes:
存储单元,用于存储所述分析单元产生的基础页面对应的状态路径以及所述第二抓取单元所抓取目标页面的快照,建立并存储目标页面的索引。The storage unit is configured to store the state path corresponding to the basic page generated by the analysis unit and the snapshot of the target page captured by the second capture unit, and to establish and store an index of the target page.
一种搜索引擎,该搜索引擎包括:上述获取目标页面的装置、用户接口单元和搜索处理单元;A search engine, the search engine comprising: the above-mentioned device for acquiring a target page, a user interface unit and a search processing unit;
所述用户接口单元,用于接收来自浏览器的搜索请求,并将该搜索请求中包含的关键词发送给所述搜索处理单元;将所述搜索处理单元发送来的搜索结果返回给浏览器,供浏览器利用用户选择的状态路径获取对应的目标页面;The user interface unit is configured to receive a search request from a browser, and send keywords contained in the search request to the search processing unit; return the search results sent by the search processing unit to the browser, For the browser to use the state path selected by the user to obtain the corresponding target page;
所述搜索处理单元,用于将所述关键词与所述装置中存储单元存储的目标页面的索引进行匹配,将匹配到的目标页面所对应的状态路径包含在搜索结果中发送给所述用户接口单元。The search processing unit is configured to match the keyword with the index of the target page stored in the storage unit in the device, include the state path corresponding to the matched target page in the search result and send it to the user interface unit.
更进一步地,所述搜索结果中还包括:所述匹配的目标页面的快照信息;Furthermore, the search result also includes: snapshot information of the matched target page;
所述用户接口单元,还用于将浏览器返回的用户选择的目标页面的快照信息发送给所述搜索处理单元;将所述搜索处理单元发送来的目标页面的快照返回给所述浏览器;The user interface unit is further configured to send the snapshot information of the target page selected by the user returned by the browser to the search processing unit; return the snapshot of the target page sent by the search processing unit to the browser;
所述搜索处理单元,还用于根据所述用户选择的目标页面的快照信息,从所述存储单元中获取对应的目标页面的快照并发送给所述用户接口单元。The search processing unit is further configured to, according to the snapshot information of the target page selected by the user, acquire a snapshot of the corresponding target page from the storage unit and send it to the user interface unit.
更进一步地,该搜索引擎还包括:路径解析单元和网络接口单元;Furthermore, the search engine also includes: a path analysis unit and a network interface unit;
所述用户接口单元,还用于接收到浏览器发送的用户选择的状态路径后,将该状态路径发送给所述路径解析单元;The user interface unit is further configured to send the state path to the path analysis unit after receiving the state path selected by the user sent by the browser;
所述路径解析单元,用于根据接收到的状态路径生成目标页面请求;The path parsing unit is configured to generate a target page request according to the received state path;
所述网络接口单元,用于将所述路径解析单元生成的目标页面请求发送给目标页面站点。The network interface unit is configured to send the target page request generated by the path analysis unit to the target page site.
一种浏览器,该浏览器包括:网络侧接口单元、路径解析单元和用户侧接口单元;A browser, the browser includes: a network side interface unit, a path analysis unit and a user side interface unit;
所述网络侧接口单元,用于接收如权利要求19所述搜索引擎发送来的包含状态路径的搜索结果;将所述路径解析单元发送来的目标页面请求发送给目标页面站点;The network-side interface unit is configured to receive the search result including the state path sent by the search engine according to claim 19; send the target page request sent by the path analysis unit to the target page site;
所述用户侧接口单元,用于将所述网络侧接口单元接收到的搜索结果显示给用户;将用户选择的状态路径发送给所述路径解析单元;The user-side interface unit is configured to display the search results received by the network-side interface unit to the user; send the state path selected by the user to the path analysis unit;
所述路径解析单元,用于根据用户选择的状态路径生成目标页面请求并发送给所述网络侧接口单元。The path analysis unit is configured to generate a target page request according to the status path selected by the user and send it to the network side interface unit.
由以上技术方案可以看出,本发明基于对基础页面及其脚本的分析,引入状态路径的概念,即产生基础页面对应的包含动态信息的状态路径,该状态路径指向的目标页面包含页面的动态内容,使得后续搜索引擎在搜索目标页面时能够抓取页面中的动态内容。It can be seen from the above technical solutions that the present invention introduces the concept of state path based on the analysis of the basic page and its script, that is, generates a state path corresponding to the basic page that contains dynamic information, and the target page pointed to by the state path contains the dynamic information of the page. Content, so that subsequent search engines can crawl the dynamic content in the page when searching for the target page.
附图说明 Description of drawings
图1为本发明提供的主要方法流程图;Fig. 1 is the main method flowchart that the present invention provides;
图2为本发明实施例一提供的详细方法流程图;FIG. 2 is a detailed method flowchart provided by Embodiment 1 of the present invention;
图3为本发明实施例二提供的产生状态路径的流程图;FIG. 3 is a flow chart of generating a state path provided by Embodiment 2 of the present invention;
图4为本发明实施例三提供的产生状态路径的流程图;FIG. 4 is a flow chart of generating a state path provided by Embodiment 3 of the present invention;
图5为本发明实施例四提供的浏览器获取目标页面的流程图;FIG. 5 is a flow chart of a browser obtaining a target page provided by Embodiment 4 of the present invention;
图6为本发明实施例五提供的浏览器获取目标页面的流程图;FIG. 6 is a flow chart of the browser obtaining the target page provided by Embodiment 5 of the present invention;
图7为本发明实施例六提供的浏览器获取目标快照的流程图;FIG. 7 is a flow chart of the browser obtaining a target snapshot provided by Embodiment 6 of the present invention;
图8为本发明提供的装置结构图示意图;Fig. 8 is a schematic diagram of the structure diagram of the device provided by the present invention;
图9为图8中分析单元的一种结构示意图;Fig. 9 is a schematic structural diagram of the analyzing unit in Fig. 8;
图10为图8中分析单元的另一种结构示意图;Fig. 10 is another schematic structural diagram of the analysis unit in Fig. 8;
图11为本发明提供的搜索引擎结构示意图;Fig. 11 is a schematic structural diagram of a search engine provided by the present invention;
图12为本发明提供的浏览器结构示意图。Fig. 12 is a schematic structural diagram of a browser provided by the present invention.
具体实施方式 Detailed ways
为了使本发明的目的、技术方案和优点更加清楚,下面结合附图和具体实施例对本发明进行详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be described in detail below in conjunction with the accompanying drawings and specific embodiments.
本发明提供的主要方法可以如图1所示,包括以下步骤:The main method provided by the present invention can be as shown in Figure 1, comprises the following steps:
步骤101:抓取所接收到URL对应的基础页面以及该基础页面的脚本。Step 101: Grab the basic page corresponding to the received URL and the script of the basic page.
步骤102:对抓取的基础页面和脚本进行分析,产生与该基础页面对应的一条以上包含动态信息的状态路径;其中,状态路径包含:基础页面的URL、基础页面中产生动态信息的文档对象模型(DOM)事件的位置信息以及该产生动态信息的DOM事件对应的回调函数索引。Step 102: Analyze the captured basic page and script, and generate more than one status path containing dynamic information corresponding to the basic page; wherein, the status path includes: the URL of the basic page, the document object generating dynamic information in the basic page The location information of the model (DOM) event and the callback function index corresponding to the DOM event that generates dynamic information.
步骤103:利用产生的状态路径抓取目标页面。Step 103: Use the generated state path to grab the target page.
上述图1所示的方法流程是搜索引擎所执行的操作,更进一步地,搜索引擎还会将产生的状态路径进行存储,以便在接收到浏览器的搜索请求后,向浏览器返回包含状态路径的搜索结果,供浏览器利用用户选择的状态路径获取对应的目标页面。The method flow shown in Figure 1 above is the operation performed by the search engine. Further, the search engine will also store the generated state path, so that after receiving the search request from the browser, it will return to the browser containing the state path The search results for the browser to use the state path selected by the user to obtain the corresponding target page.
下面通过具体实施例对上述方法进行详细描述。The above method will be described in detail below through specific examples.
实施例一、Embodiment one,
图2为本发明实施例一提供的详细方法流程图,如图2所示,该方法可以具体包括以下步骤:Fig. 2 is a detailed method flowchart provided by Embodiment 1 of the present invention. As shown in Fig. 2, the method may specifically include the following steps:
步骤201:搜索引擎接收URL。Step 201: The search engine receives the URL.
搜索引擎可以在后台自动批量地抓取URL。Search engines can automatically crawl URLs in batches in the background.
步骤202:抓取所接收到URL对应的基础页面以及该基础页面的脚本。Step 202: Grab the basic page corresponding to the received URL and the script of the basic page.
基础页面与脚本的对应关系可以存在以下两种:其一、基础页面原始码包含的HTML标签中存在脚本文档。其二、基础页面原始码包含的HTML标签中存在脚本文档的链接,该脚本文档的链接指向脚本文档;也就是说,基础页面和脚本文档为两个不同的文档,但存在引用关系。There may be the following two types of correspondence between the basic page and the script: first, the script document exists in the HTML tag included in the source code of the basic page. Second, there is a link to the script document in the HTML tag contained in the original code of the basic page, and the link of the script document points to the script document; that is to say, the basic page and the script document are two different documents, but there is a reference relationship.
步骤203:对抓取的基础页面中下载到的DOM节点进行分析,判断DOM节点中DOM事件对应的脚本是否产生动态信息,根据分析结果产生与该基础页面对应的一条以上包含动态信息的状态路径以及利用该状态路径获取目标页面;其中,状态路径包含:基础页面的URL、基础页面中产生动态信息的DOM事件的位置信息以及该产生动态信息的DOM事件对应的回调函数索引。Step 203: Analyze the DOM nodes downloaded in the captured basic page, determine whether the script corresponding to the DOM event in the DOM node generates dynamic information, and generate more than one state path corresponding to the basic page containing dynamic information according to the analysis result And using the state path to obtain the target page; wherein, the state path includes: the URL of the base page, the location information of the DOM event that generates dynamic information in the base page, and the callback function index corresponding to the DOM event that generates dynamic information.
本发明中涉及的脚本语言包括但不限于:java script、vbscript、perl或者python。The script language involved in the present invention includes but not limited to: javascript, vbscript, perl or python.
其中,DOM事件的位置信息可以包含:DOM节点标识、DOM节点的路径(Xpath)、DOM事件标识。其中,DOM节点标识可以为:DOM节点的ID或DOM节点的名称。Wherein, the location information of the DOM event may include: a DOM node identifier, a path (Xpath) of the DOM node, and a DOM event identifier. Wherein, the DOM node identifier may be: the ID of the DOM node or the name of the DOM node.
状态路径中的回调函数索引用于对DOM事件对应的回调函数进行引用。脚本中所有的回调函数都具备索引,且该索引与具体回调函数的对应关系可以通过全局函数表、映射函数等数据结构进行存储。通过状态路径中的回调函数索引,到包含索引与具体回调函数的对应关系的数据结构中进行查询,便可以获取到DOM事件对应的回调函数。这里的回调函数可以包括:匿名回调函数和非匿名回调函数。The callback function index in the state path is used to reference the callback function corresponding to the DOM event. All callback functions in the script have indexes, and the corresponding relationship between the indexes and specific callback functions can be stored through data structures such as global function tables and mapping functions. The callback function corresponding to the DOM event can be obtained by querying the data structure containing the corresponding relationship between the index and the specific callback function through the callback function index in the state path. The callback function here may include: an anonymous callback function and a non-anonymous callback function.
针对状态路径,完成DOM事件所对应回调函数的编译和执行后,便可以获取对应的目标页面。For the state path, after compiling and executing the callback function corresponding to the DOM event, the corresponding target page can be obtained.
本步骤中的具体实现将在实施例二和实施例三中进行详细描述。The specific implementation of this step will be described in detail in Embodiment 2 and Embodiment 3.
对于一个基础页面,其可以对应N条状态路径,对应N个目标页面,其中N可以为一以上的整数。For a basic page, it may correspond to N state paths and N target pages, where N may be an integer greater than one.
例如,对于URL为www.baidu.com的基础页面,产生的两条状态路径可以为:For example, for the base page whose URL is www.baidu.com, the two generated state paths can be:
{base_url:http://www.baidu.com,id:idsample1,xpath:html/body/a/,event:click,type:new_content,callback:fun1}{base_url: http://www.baidu.com, id: idsample1, xpath: html/body/a/, event: click, type: new_content, callback: fun1}
{base_url:http://www.baidu.com,id:idsample2,xpath:html/body/li/a/,event:click,type:new_link,callback:fun2}{base_url: http://www.baidu.com, id: idsample2, xpath: html/body/li/a/, event: click, type: new_link, callback: fun2}
需要说明的是,本发明并不限定状态路径的具体格式,上述仅是其中一种实例。It should be noted that the present invention does not limit the specific format of the state path, and the above is only one example.
步骤204:存储基础页面对应的状态路径以及状态路径对应的目标页面快照,建立并存储目标页面的索引,以便后续被搜索引擎找到并作为搜索结果返回给浏览器。Step 204: Store the state path corresponding to the basic page and the snapshot of the target page corresponding to the state path, establish and store an index of the target page, so that it can be found by the search engine and returned to the browser as a search result.
该实施例中,可以对步骤202抓取的基础页面及其脚本进行存储,对步骤203产生的状态路径进行存储,以及对步骤203获取的目标页面快照进行存储。其中,基础页面的存储具体可以包括:基础页面URL、基础页面快照等。In this embodiment, the basic page captured in
搜索引擎获取目标页面的流程可以定期执行,也可以人为触发执行。每次产生基础页面对应的状态路径时,如果存在已存储的状态路径,则可以将产生的基础页面对应的状态路径与已经存储的该基础页面对应的状态路径进行比较,如果不同,则及时更新存储的基础页面对应的状态路径。The process for the search engine to obtain the target page can be executed periodically or manually triggered. Each time the state path corresponding to the basic page is generated, if there is a stored state path, the generated state path corresponding to the basic page can be compared with the stored state path corresponding to the basic page, and if they are different, update in time The state path corresponding to the stored base page.
另外,搜索引擎可以定期根据目标页面的索引检查目标页面是否有更新,并及时更新存储的目标页面的索引。同样,如果每次获取的目标页面快照与已存储的目标页面快照不同,则可以用新获取的目标页面快照替换已存储的目标页面快照。In addition, the search engine can regularly check whether the target page is updated according to the index of the target page, and update the stored index of the target page in time. Likewise, if the target page snapshot acquired each time is different from the stored target page snapshot, the stored target page snapshot may be replaced with a newly acquired target page snapshot.
对于上述存储的三类内容,可以分别独立进行存储,也可以合并存储。The above three types of stored content can be stored independently or combined.
上述步骤201至步骤204均是搜索引擎在后台的操作,如果搜索引擎接收到来自浏览器的搜索请求,则继续在前台执行下述步骤。The
步骤205:接收到来自浏览器的搜索请求后,将搜索请求所包含关键词与各目标页面的索引进行匹配,将匹配的目标页面所对应的状态路径包含在搜索结果中返回给浏览器,供浏览器利用用户选择的状态路径获取对应的目标页面。Step 205: After receiving the search request from the browser, match the keywords contained in the search request with the indexes of each target page, include the state path corresponding to the matched target page in the search result and return it to the browser for The browser uses the state path selected by the user to obtain the corresponding target page.
当搜索引擎接收到包含关键词的搜索请求后,除了目标页面的索引参与匹配之外,基础页面的索引也会参与匹配,也就是说,基础页面也会包含在搜索结果中,这部分与现有技术相同,不再具体赘述。When the search engine receives a search request containing keywords, in addition to the index of the target page participating in the matching, the index of the basic page will also participate in the matching, that is to say, the basic page will also be included in the search results. There are technologies that are the same, and will not be described in detail.
更进一步地,搜索结果中还可以包含目标页面的快照信息,或者,还可以包含目标页面的索引。Furthermore, the search result may also include snapshot information of the target page, or may also include an index of the target page.
在本步骤中,浏览器具体如何利用用户选择的状态路径获取对应的目标页面参见实施例四和实施例五。In this step, how the browser obtains the corresponding target page using the state path selected by the user refers to Embodiment 4 and Embodiment 5.
上述步骤203中产生状态路径的方式可以采用实施例二和实施例三两种方式。The manner of generating the state path in the
实施例二、Embodiment two,
图3为本发明实施例二提供的产生状态路径的流程图,如图3所示,可以具体包括以下步骤:FIG. 3 is a flow chart of generating a state path provided by Embodiment 2 of the present invention. As shown in FIG. 3 , it may specifically include the following steps:
步骤301:在基础页面及其脚本的抓取过程中下载各DOM节点。Step 301: Download each DOM node during the crawling process of the basic page and its script.
步骤302:判断是否结束DOM节点的下载,如果是,结束基础页面的抓取流程,转至执行步骤306;否则,对当前下载到的DOM节点执行步骤303。Step 302: Determine whether to end the download of the DOM node, if yes, end the crawling process of the basic page, and go to step 306; otherwise, execute
步骤303:判断当前下载到的DOM节点是否为script标签,如果是,对下一个下载到的DOM节点转至步骤302;否则,执行步骤304。Step 303: Determine whether the currently downloaded DOM node is a script tag, if yes, go to step 302 for the next downloaded DOM node; otherwise, execute
对于script标签的节点,可以将该script标签对应的脚本发送至脚本解析引擎进行编译执行。For a node with a script tag, the script corresponding to the script tag can be sent to the script parsing engine for compilation and execution.
步骤304:判断该DOM节点是否含有DOM事件以及回调函数,如果否,跳出该DOM节点的分析,对下一个下载到的DOM节点转至步骤302;如果是,执行步骤305。Step 304: Determine whether the DOM node contains a DOM event and a callback function, if not, skip the analysis of the DOM node, and go to step 302 for the next downloaded DOM node; if yes, execute
如果该DOM节点不包含DOM事件以及回调函数,则说明该DOM节点中不会引起页面跳转和新的页面内容,即不会产生页面动态信息,可以跳过该DOM节点,如果存在下一DOM节点,则开始下一DOM节点的分析。If the DOM node does not contain DOM events and callback functions, it means that the DOM node will not cause page jumps and new page content, that is, no page dynamic information will be generated, and the DOM node can be skipped. If there is a next DOM node, start the analysis of the next DOM node.
步骤305:利用该DOM节点包含的DOM事件产生状态路径,并将产生的状态路径保存在状态路径队列中;对下一个下载到的DOM节点转至步骤302。Step 305: Use the DOM event contained in the DOM node to generate a state path, and save the generated state path in the state path queue; go to step 302 for the next downloaded DOM node.
步骤306:逐一获取状态队列中的各状态路径所对应的目标页面,判断是否产生新的页面内容或发生页面跳转,将产生新的页面内容或发生页面跳转的状态路径确定为基础页面对应的状态路径。Step 306: Obtain the target pages corresponding to each state path in the state queue one by one, determine whether new page content or page jump occurs, and determine the state path that generates new page content or page jump occurs as the basic page correspondence state path.
然后可以将产生新的页面内容或发生页面跳转的状态路径及其对应的目标页面进行存储。Then, the state paths that generate new page content or page jumps and their corresponding target pages can be stored.
判断是否发生页面跳转的方式可以为:如果目标页面和基础页面的URL不同,则确定发生页面跳转。判断是否产生新的页面内容的方式可以为:对目标页面和基础页面进行句子签名或字符串比对,或者,计算目标页面和基础页面的相似度,如果比对结果或相似度计算结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容。其中,在进行句子签名的比对时,句子签名的计算可以采用诸如MD5等现有的计算方式,在此不做具体限制。A manner of judging whether a page jump occurs may be as follows: if the URLs of the target page and the base page are different, then it is determined that a page jump occurs. The method of judging whether to generate new page content can be: perform sentence signature or character string comparison between the target page and the base page, or calculate the similarity between the target page and the base page, if the comparison result or the similarity calculation result shows that the target page If the page and the base page have different page contents, it is determined to generate new page contents. Wherein, when comparing sentence signatures, existing calculation methods such as MD5 can be used for the calculation of sentence signatures, and there is no specific limitation here.
在该实施例二中,首先将DOM事件产生的状态路径均保存在状态路径队列中,但由于DOM事件的状态路径并不一定都产生页面动态信息,也可能会存在一些无效的状态路径,因此,再逐一对状态路径队列中的各状态路径进行判断,确定该状态路径队列对应的目标页面是否包含动态信息。步骤303至步骤305的流程是对各DOM节点进行分析初步产生状态路径的过程,也就是说,对各下载到的DOM节点均执行步骤303至步骤305,直至下载完所有的DOM节点后执行步骤306最终确定基础页面对应的状态路径。In the second embodiment, the state paths generated by DOM events are first stored in the state path queue, but because the state paths of DOM events do not necessarily generate page dynamic information, there may also be some invalid state paths, so , and then judge each state path in the state path queue one by one to determine whether the target page corresponding to the state path queue contains dynamic information. The flow from
实施例三、Embodiment three,
图4为本发明实施例三提供的产生状态路径的流程图,如图4所示,可以具体包括以下步骤:FIG. 4 is a flow chart of generating a state path provided by Embodiment 3 of the present invention. As shown in FIG. 4 , it may specifically include the following steps:
步骤401:在基础页面及其脚本的抓取过程中下载各DOM节点。Step 401: Download each DOM node during the crawling process of the basic page and its script.
步骤402:判断是否结束DOM节点的下载,如果是,结束基础页面的抓取流程;否则,对当前下载到的DOM节点执行步骤403。Step 402: Determine whether to end the download of the DOM node, if yes, end the crawling process of the basic page; otherwise, execute
步骤403:判断当前下载到的DOM节点是否为script标签,如果是,对下一个下载到的DOM节点转至步骤402;否则,执行步骤404。Step 403: Determine whether the currently downloaded DOM node is a script tag, if yes, go to step 402 for the next downloaded DOM node; otherwise, execute
对于script标签的节点,可以将该script标签对应的脚本发送至脚本解析引擎进行编译执行。For a node with a script tag, the script corresponding to the script tag can be sent to the script parsing engine for compilation and execution.
步骤404:判断该DOM节点是否含有DOM事件以及回调函数,如果否,跳出该DOM节点的分析,对下一个下载到的DOM节点转至步骤402;如果是,执行步骤405。Step 404: Determine whether the DOM node contains a DOM event and a callback function, if not, skip the analysis of the DOM node, and go to step 402 for the next downloaded DOM node; if yes, execute
步骤405:利用该DOM节点中的DOM事件产生状态路径。Step 405: Utilize the DOM event in the DOM node to generate a state path.
在本步骤中,可以对所有的DOM事件产生状态路径,更优地,也可以对在预设DOM事件列表中的DOM事件产生状态路径。其中预设DOM事件列表中的DOM事件可以包括:onclick,ondbclick,onmouseover,onmousemove,onmouseout,onblur,onfocus,onchange,onsubmit,onselect等,这些DOM事件都是可能产生页面动态信息的DOM事件。In this step, a state path can be generated for all DOM events, more preferably, a state path can also be generated for DOM events in the preset DOM event list. The DOM events in the preset DOM event list may include: onclick, ondbclick, onmouseover, onmousemove, onmouseout, onblur, onfocus, onchange, onsubmit, onselect, etc. These DOM events are DOM events that may generate page dynamic information.
步骤406:获取该状态路径所对应的目标页面,判断是否产生新的页面内容或产生页面跳转,如果是,执行步骤407;否则对下一个下载到的DOM节点转至步骤402。Step 406: Obtain the target page corresponding to the state path, judge whether to generate new page content or page jump, if so, execute
步骤407:确定该状态路径为基础页面对应的状态路径,可以存储该状态路径及其对应的目标页面,对下一个下载到的DOM节点转至步骤402。Step 407: Determine that the state path is the state path corresponding to the basic page, store the state path and its corresponding target page, and go to step 402 for the next downloaded DOM node.
与实施例二不同的是,实施例三中每产生一个状态路径均进行判断,确定该状态路径队列对应的目标页面是否包含动态信息(即步骤406),如果包含则存储该状态路径及其对应的目标页面。步骤403至步骤407是对各下载到的DOM节点进行分析后产生状态路径的过程,也就是说,对各下载到的DOM节点均执行步骤403至步骤407,直至下载完所有的DOM节点。Different from embodiment two, in embodiment three, every time a state path is generated, a judgment is made to determine whether the target page corresponding to the state path queue contains dynamic information (ie step 406), and if so, store the state path and its corresponding of the target page.
至此实施例三所示流程结束。So far, the process shown in the third embodiment ends.
在上述实施例二和实施例三中,在获取状态路径所对应的目标页面,判断是否产生新的页面内容或产生页面跳转的步骤时,会将DOM事件对应的回调函数索引发送给脚本解析引擎,由脚本解析引擎根据该回调函数索引获取对应的回调函数,根据对获取的回调函数进行编译和执行的结果执行获取状态路径所对应的目标页面,判断是否产生新的页面内容或产生页面跳转的步骤。其中,对于匿名函数而言,脚本解析引擎在获取对应的回调函数后,会对获取的回调函数进行实时地编译和执行,而对于非匿名函数而言,脚本解析引擎在获取对应的回调函数后,可以利用之前对该回调函数的编译和执行结果。In the above-mentioned second and third embodiments, when obtaining the target page corresponding to the status path and judging whether to generate new page content or page jump, the callback function index corresponding to the DOM event will be sent to the script for analysis engine, the script parsing engine obtains the corresponding callback function according to the callback function index, executes the target page corresponding to the state path according to the result of compiling and executing the obtained callback function, and judges whether to generate new page content or generate a page jump turn steps. Among them, for anonymous functions, the script parsing engine will compile and execute the obtained callback functions in real time after obtaining the corresponding callback functions, while for non-anonymous functions, the script parsing engine will compile and execute the obtained callback functions in real time after obtaining the corresponding callback functions , you can use the previous compilation and execution results of the callback function.
浏览器利用状态路径获取目标页面的方式根据浏览器是否具备解析状态路径功能可以分为两种,分别通过实施例四和实施例五进行描述。The manner in which the browser obtains the target page by using the state path can be divided into two types according to whether the browser has the function of parsing the state path, which will be described through Embodiment 4 and Embodiment 5 respectively.
实施例四、Embodiment four,
当浏览器具备解析状态路径的功能时,对应的流程图如图5所示,包括以下步骤:When the browser has the function of parsing the state path, the corresponding flow chart is shown in Figure 5, including the following steps:
步骤501:浏览器向搜索引擎发送包含关键词的搜索请求(Query)。Step 501: the browser sends a search request (Query) including keywords to the search engine.
步骤502:搜索引擎执行步骤205向浏览器返回包含状态路径的搜索结果。Step 502: The search engine executes
步骤503:浏览器根据用户选择的状态路径,向目标页面站点发送目标页面请求。Step 503: The browser sends a target page request to the target page site according to the state path selected by the user.
当用户点击目标页面的状态路径时,浏览器解析用户点击的状态路径,根据该状态路径向目标页面站点发送目标页面请求。When the user clicks the state path of the target page, the browser parses the state path clicked by the user, and sends a target page request to the target page site according to the state path.
步骤504:目标页面站点向浏览器推送目标页面。Step 504: The target page site pushes the target page to the browser.
实施例五、Embodiment five,
当浏览器不具备解析状态路径的功能时,对应的流程图如图6所示,包括以下步骤:When the browser does not have the function of parsing the state path, the corresponding flow chart is shown in Figure 6, including the following steps:
步骤601:浏览器向搜索引擎发送包含关键词的搜索请求。Step 601: the browser sends a search request including keywords to the search engine.
步骤602:搜索引擎执行步骤205向浏览器返回包含状态路径的搜索结果。Step 602: The search engine executes
步骤603:浏览器将用户选择的状态路径发送给搜索引擎。Step 603: the browser sends the state path selected by the user to the search engine.
步骤604:搜索引擎根据用户选择的状态路径向目标页面站点发送目标页面请求。Step 604: The search engine sends a target page request to the target page site according to the status path selected by the user.
由于浏览器不具备状态路径解析功能,因此,浏览器仅将用户选择的状态路径发送给搜索引擎,由搜索引擎解析状态路径并根据该状态路径向目标页面站点发送目标页面请求。Since the browser does not have the state path parsing function, the browser only sends the state path selected by the user to the search engine, and the search engine parses the state path and sends a target page request to the target page site according to the state path.
步骤605:目标页面站点向浏览器推送目标页面。Step 605: The target page site pushes the target page to the browser.
搜索引擎发送的目标页面请求中会包含浏览器信息,以便目标页面站点会将目标页面推送给浏览器。The target page request sent by the search engine will include browser information so that the target page site will push the target page to the browser.
至此实施例五所示流程结束。So far, the process shown in the fifth embodiment ends.
还有一种情况,如果搜索引擎在实施例一的步骤205返回的搜索结果中包含目标页面快照信息时,如果用户点击目标页面快照,则浏览器和搜索引擎之间的交互可以按照实施例六执行。In another case, if the search engine returns the target page snapshot information in
实施例六、Embodiment six,
图7为实施例六提供的浏览器获取目标快照的流程图,如图7所示,可以包括以下步骤:Fig. 7 is the flowchart that the browser that embodiment 6 provides obtains target snapshot, as shown in Fig. 7, may include the following steps:
步骤701:浏览器向搜索引擎发送包含关键词的搜索请求。Step 701: the browser sends a search request including keywords to the search engine.
步骤702:搜索引擎执行步骤205向浏览器返回包含状态路径和目标页面快照信息的搜索结果。Step 702: The search engine executes
步骤703:浏览器将用户选择的目标页面快照信息发送给搜索引擎。Step 703: the browser sends the snapshot information of the target page selected by the user to the search engine.
步骤704:搜索引擎确定对应的目标页面快照并返回给浏览器。Step 704: The search engine determines the corresponding target page snapshot and returns it to the browser.
由于搜索引擎已经在本地存储了各目标页面快照,因此无需再与目标页面站点进行交互,直接从本地获取对应的目标页面快照后返回给浏览器。Since the search engine has stored the snapshots of the target pages locally, there is no need to interact with the target page site, and the corresponding target page snapshots are directly acquired locally and returned to the browser.
以上是对本发明所提供的方法进行的详细描述,下面对本发明所提供的获取目标页面的装置进行详细描述,如图8所示,该装置可以包括:第一抓取单元800、分析单元810和第二抓取单元820。The above is a detailed description of the method provided by the present invention. The following is a detailed description of the device for obtaining the target page provided by the present invention. As shown in FIG. 8 , the device may include: a
第一抓取单元800,用于抓取所接收到URL对应的基础页面以及该基础页面的脚本。The
分析单元810,用于对第一抓取单元800抓取的基础页面和脚本进行分析,产生基础页面对应的一条以上包含动态信息的状态路径;其中,状态路径包含:基础页面的URL、基础页面中产生动态信息的DOM事件的位置信息以及DOM事件对应的回调函数索引。The
第二抓取单元820,用于利用分析单元810产生的状态路径抓取目标页面。The
其中,分析单元810可以采用两种结构,第一种结构如图9中所示,具体包括:第一判断模块811、第二判断模块812、第一路径生成模块813和第一路径确定模块814。Wherein, the
第一抓取单元800在基础页面及其脚本的抓取过程中下载各DOM节点,并将当前下载到的DOM节点发送给第一判断模块811,直至结束所有DOM节点的下载后,向第一路径确定模块814发送确定通知。The
第一判断模块811,用于判断当前下载到的DOM节点是否为script标签,如果是,触发第一抓取单元800下载下一个DOM节点,否则,向第二判断模块812发送判断通知。The
第二判断模块812,用于判断当前下载到的DOM节点是否含有DOM事件以及回调函数,如果否,触发第一抓取单元800下载下一个DOM节点,如果是,向第一路径生成模块813发送执行通知。The
第一路径生成模块813,用于接收到执行通知后,利用当前下载到的DOM节点产生状态路径,并将产生的状态路径保存在状态路径队列中,触发第一抓取单元800下载下一个DOM节点。The first
第一路径确定模块814,用于接收到确定通知时,触发第二抓取单元820逐一获取状态队列中各状态路径对应的目标页面,根据第二抓取单元820的获取结果判断是否产生新的页面内容或发生页面跳转,将产生的新的页面内容或发生页面跳转的状态路径确定为基础页面对应的状态路径。The first
另外,分析单元810的第二种结构如图10所示,可以具体包括:第三判断模块911、第四判断模块912、第二路径生成模块913和第二路径确定模块914。In addition, the second structure of the analyzing
第一抓取单元800在基础页面及其脚本的抓取过程中下载各DOM节点,并将当前下载到的DOM节点发送给第三判断模块911,直至结束所有DOM节点的下载。The
第三判断模块911,用于判断当前下载的DOM节点是否为script标签,如果是,触发第一抓取单元800下载下一个DOM节点,否则,向第四判断模块912发送判断通知。The
第四判断模块912,用于判断当前下载到的DOM节点是否含有DOM事件以及回调函数,如果否,触发第一抓取单元800下载下一个DOM节点,如果是,向第二路径生成模块913发送执行通知。The
第二路径生成模块913,用于接收到执行通知时,利用当前下载到的DOM节点包含的DOM事件产生状态路径,并将产生的状态路径发送给第二路径确定模块914。The second
第二路径确定模块914,用于接收到状态路径时,触发第二抓取单元820获取该状态路径所对应的目标页面,根据第二抓取单元820的获取结果,判断是否产生新的页面内容或产生页面跳转,如果是,确定该状态路径为基础页面对应的状态路径,触发第一抓取单元800下载下一个DOM节点,否则触发第一抓取单元800下载下一个DOM节点。The second
具体地,应用于上述两种结构时,判断是否发生页面跳转可以包括:如果获取的目标页面和基础页面的URL不同,则确定发生页面跳转。Specifically, when applied to the above two structures, judging whether a page jump occurs may include: if the obtained URLs of the target page and the base page are different, determining that a page jump occurs.
判断是否产生新的页面内容可以包括:将获取的目标页面和基础页面进行句子签名或字符串比对,如果比对结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容;或者,计算获取的目标页面和基础页面的相似度,如果计算结果表明目标页面和基础页面具有不同的页面内容,则确定产生新的页面内容。Judging whether to generate new page content may include: comparing the acquired target page and the base page with sentence signatures or character strings, and if the comparison result shows that the target page and the base page have different page content, then determine to generate new page content ; or, calculate the similarity between the acquired target page and the base page, and if the calculation result shows that the target page and the base page have different page content, determine to generate new page content.
其中,状态路径中的上述DOM事件的位置信息包括:DOM节点标识、DOM节点的Xpath以及DOM事件标识。Wherein, the above-mentioned location information of the DOM event in the state path includes: a DOM node identifier, an Xpath of the DOM node, and a DOM event identifier.
更进一步地,该装置还可以包括:Furthermore, the device may also include:
存储单元830,用于存储分析单元810产生的基础页面对应的状态路径以及第二抓取单元820所抓取目标页面的快照,建立并存储目标页面的索引。The
另外,存储单元830还会对第一抓取单元800抓取的基础页面进行存储,其中,对于基础页面、状态路径以及目标页面的快照三部分可以采用分别独立存储的方式,也可以采用统一存储的方式。In addition, the
图11为本发明提供的搜索引擎结构示意图,如图11所示,该搜索引擎包括:图8中所示的装置、用户接口单元1101和搜索处理单元1102。FIG. 11 is a schematic structural diagram of a search engine provided by the present invention. As shown in FIG. 11 , the search engine includes: the device shown in FIG. 8 , a user interface unit 1101 and a search processing unit 1102 .
用户接口单元1101,用于接收来自浏览器的搜索请求,并将该搜索请求中包含的关键词发送给搜索处理单元1102;将搜索处理单元1102发送来的搜索结果返回给浏览器,供浏览器利用用户选择的状态路径获取对应的目标页面。The user interface unit 1101 is configured to receive a search request from the browser, and send the keywords contained in the search request to the search processing unit 1102; return the search results sent by the search processing unit 1102 to the browser for the browser to Use the state path selected by the user to obtain the corresponding target page.
搜索处理单元1102,用于将关键词与存储单元830存储的目标页面的索引进行匹配,将匹配到的目标页面所对应的状态路径包含在搜索结果中发送给用户接口单元1101。The search processing unit 1102 is configured to match the keyword with the index of the target page stored in the
更优地,搜索结果中还可以包括:匹配的目标页面的快照信息。此时,More preferably, the search result may further include: snapshot information of the matched target page. at this time,
用户接口单元1101,还用于将浏览器返回的用户选择的目标页面的快照信息发送给搜索处理单元1102;将搜索处理单元1102发送来的目标页面的快照返回给浏览器。The user interface unit 1101 is further configured to send the snapshot information of the target page selected by the user returned by the browser to the search processing unit 1102; and return the snapshot of the target page sent by the search processing unit 1102 to the browser.
搜索处理单元1102,还用于根据用户选择的目标页面的快照信息,从存储单元830中获取对应的目标页面的快照并发送给用户接口单元1101。The search processing unit 1102 is further configured to obtain a snapshot of the corresponding target page from the
更进一步地,当浏览器不具备状态路径的解析功能时,该搜索引擎需要具备该功能从而协助完成目标页面向浏览器的推送。此时,该搜索引擎还可以进一步包括:路径解析单元1103和网络接口单元1104。Furthermore, when the browser does not have the function of parsing the state path, the search engine needs to have this function to assist in pushing the target page to the browser. At this point, the search engine may further include: a path parsing unit 1103 and a network interface unit 1104 .
用户接口单元1101,还用于接收到浏览器发送的用户选择的状态路径后,将该状态路径发送给路径解析单元1103。The user interface unit 1101 is further configured to send the state path to the path analysis unit 1103 after receiving the state path selected by the user sent by the browser.
路径解析单元1103,用于根据接收到的状态路径生成目标页面请求。The path parsing unit 1103 is configured to generate a target page request according to the received status path.
网络接口单元1104,用于将路径解析单元1103生成的目标页面请求发送给目标页面站点。The network interface unit 1104 is configured to send the target page request generated by the path analysis unit 1103 to the target page site.
图12为本发明提供的浏览器结构示意图,该浏览器具备状态路径解析功能,如图12所示,该浏览器可以包括:网络侧接口单元1201、路径解析单元1202和用户侧接口单元1203。FIG. 12 is a structural diagram of a browser provided by the present invention. The browser has a state path analysis function. As shown in FIG.
网络侧接口单元1201,用于接收图11所示搜索引擎发送来的包含状态路径的搜索结果;将路径解析单元1202发送来的目标页面请求发送给目标页面站点。The network
用户侧接口单元1203,用于将网络侧接口单元1201接收到的搜索结果显示给用户;将用户选择的状态路径发送给路径解析单元1202。The user-
路径解析单元1202,用于根据用户选择的状态路径生成目标页面请求并发送给网络侧接口单元1201。The
以上所述仅为本发明的较佳实施例而已,并不用以限制本发明,凡在本发明的精神和原则之内,所做的任何修改、等同替换、改进等,均应包含在本发明保护的范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the present invention. within the scope of protection.
Claims (22)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010105314609A CN101984429B (en) | 2010-11-04 | 2010-11-04 | Method and device for acquiring destination page, search engine and browser |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN2010105314609A CN101984429B (en) | 2010-11-04 | 2010-11-04 | Method and device for acquiring destination page, search engine and browser |
Publications (2)
Publication Number | Publication Date |
---|---|
CN101984429A CN101984429A (en) | 2011-03-09 |
CN101984429B true CN101984429B (en) | 2012-03-14 |
Family
ID=43641598
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN2010105314609A Active CN101984429B (en) | 2010-11-04 | 2010-11-04 | Method and device for acquiring destination page, search engine and browser |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101984429B (en) |
Families Citing this family (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103150307B (en) * | 2011-12-06 | 2016-02-10 | 株式会社理光 | The method and apparatus of the title relevant to descriptor is searched from network |
CN103268361B (en) * | 2013-06-07 | 2019-05-31 | 百度在线网络技术(北京)有限公司 | Extracting method, the device and system of URL are hidden in webpage |
CN103645968B (en) * | 2013-12-02 | 2017-03-15 | 北京奇虎科技有限公司 | A kind of browser status restored method and device |
CN103955495B (en) * | 2014-04-18 | 2018-05-04 | 百度在线网络技术(北京)有限公司 | The method for down loading and device of page child resource |
CN105740290A (en) * | 2014-12-11 | 2016-07-06 | 富士通株式会社 | System and method for searching self-adaptive networks of mobile devices |
CN104408198B (en) * | 2014-12-15 | 2018-07-17 | 北京国双科技有限公司 | The acquisition methods and device of Webpage content |
CN105867897A (en) * | 2015-12-07 | 2016-08-17 | 乐视网信息技术(北京)股份有限公司 | Page redirection analysis method and apparatus |
CN105718559B (en) * | 2016-01-20 | 2018-02-13 | 百度在线网络技术(北京)有限公司 | Search forms pages and the method and apparatus of target pages transforming relationship |
CN105740417A (en) * | 2016-01-29 | 2016-07-06 | 青岛海信移动通信技术股份有限公司 | Webpage based target data search method and module, browser and terminal |
US11080302B2 (en) * | 2016-05-31 | 2021-08-03 | Ebay Inc. | Identifying missing browse nodes |
CN107025111A (en) * | 2017-03-17 | 2017-08-08 | 烽火通信科技股份有限公司 | The method and system that a kind of browser target pages entire screen switch is shown |
CN107169011B (en) * | 2017-03-31 | 2021-06-11 | 百度在线网络技术(北京)有限公司 | Webpage originality identification method and device based on artificial intelligence and storage medium |
CN110874446A (en) * | 2018-08-31 | 2020-03-10 | 北京京东尚科信息技术有限公司 | Page display method and system, computer system and computer readable medium |
CN110674427B (en) * | 2019-09-20 | 2022-04-22 | 北京达佳互联信息技术有限公司 | Method, device, equipment and storage medium for responding to webpage access request |
CN111177539A (en) * | 2019-12-16 | 2020-05-19 | 北京百度网讯科技有限公司 | Search result page generation method and device, electronic equipment and storage medium |
JP7322194B2 (en) | 2020-04-29 | 2023-08-07 | バイドゥ オンライン ネットワーク テクノロジー(ペキン) カンパニー リミテッド | DATA UPDATE METHOD, DEVICE, SEARCH SERVER, TERMINAL AND STORAGE MEDIUM |
CN111767442B (en) * | 2020-04-29 | 2023-12-26 | 百度在线网络技术(北京)有限公司 | Data updating method, device, search server, terminal and storage medium |
WO2021226954A1 (en) * | 2020-05-14 | 2021-11-18 | 深圳市欢太科技有限公司 | Information crawling method and apparatus, and electronic device and storage medium |
CN113657076B (en) * | 2021-08-17 | 2023-08-22 | 中国平安财产保险股份有限公司 | Page operation record table generation method and device, electronic equipment and storage medium |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101587488A (en) * | 2009-05-25 | 2009-11-25 | 深圳市腾讯计算机系统有限公司 | Method and device for detecting re-orientation of page in search engine |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7720835B2 (en) * | 2006-05-05 | 2010-05-18 | Visible Technologies Llc | Systems and methods for consumer-generated media reputation management |
-
2010
- 2010-11-04 CN CN2010105314609A patent/CN101984429B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101587488A (en) * | 2009-05-25 | 2009-11-25 | 深圳市腾讯计算机系统有限公司 | Method and device for detecting re-orientation of page in search engine |
Also Published As
Publication number | Publication date |
---|---|
CN101984429A (en) | 2011-03-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101984429B (en) | Method and device for acquiring destination page, search engine and browser | |
US11150874B2 (en) | API specification generation | |
US7885950B2 (en) | Creating search enabled web pages | |
US7536389B1 (en) | Techniques for crawling dynamic web content | |
CN103744853B (en) | The method and device of Research of Search Engine Website Snapshot System information is provided | |
US7694282B2 (en) | Mapping breakpoints between web based documents | |
CN102054028B (en) | Method for implementing web-rendering function by using web crawler system | |
CN109033195A (en) | The acquisition methods of webpage information obtain equipment and computer-readable medium | |
CN104021231B (en) | The method and apparatus that webpage is shown in browser | |
CN101515300A (en) | Method and system for grabbing Ajax webpage content | |
JP2020126641A (en) | API mashup exploration and recommendations | |
KR102009020B1 (en) | Method and apparatus for providing website authentication data for search engine | |
CN106294885A (en) | A kind of data collection towards isomery webpage and mask method | |
CN109246069B (en) | Webpage login method and device and readable storage medium | |
US20140122693A1 (en) | Web Navigation Tracing | |
CN103942230B (en) | A kind of methods, devices and systems for carrying out voice web page navigation | |
US9426237B2 (en) | Web navigation using web navigation pattern histories | |
US20090006481A1 (en) | Information providing method and information providing apparatus | |
Panum et al. | Kraaler: A user-perspective web crawler | |
CN109471966B (en) | Method and system for automatically acquiring target data source | |
CN105528370B (en) | Page detection method and client | |
EP2662785A2 (en) | A method and system for non-ephemeral search | |
Dobriy et al. | Crawley: A Tool for Web Platform Discovery. | |
Ahmed | Resource capability discovery and description management system for bioinformatics data and service integration-an experiment with gene regulatory networks | |
Wang | Design and Implementation of Vertical Search Platform for Electronic Product Information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
ASS | Succession or assignment of patent right |
Owner name: BEIJING BAIDU NETWORK INFORMATION TECHNOLOGY CO., Free format text: FORMER OWNER: BAIDU ONLINE NETWORK TECHNOLOGY (BEIJING) CO., LTD. Effective date: 20111228 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20111228 Address after: 100085 Beijing, Haidian District, No. ten on the ground floor, No. 10 Baidu building, layer 2 Applicant after: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd. Address before: 100085 Beijing, Haidian District, No. ten on the street Baidu building, No. 10 Applicant before: BEIJING BAIDU NETCOM SCIENCE AND TECHNOLOGY Co.,Ltd. |
|
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |