[go: up one dir, main page]

CN103631906A - Method and device for recognizing page number identification in webpage URL - Google Patents

Method and device for recognizing page number identification in webpage URL Download PDF

Info

Publication number
CN103631906A
CN103631906A CN201310606990.9A CN201310606990A CN103631906A CN 103631906 A CN103631906 A CN 103631906A CN 201310606990 A CN201310606990 A CN 201310606990A CN 103631906 A CN103631906 A CN 103631906A
Authority
CN
China
Prior art keywords
url
page
characteristic
webpage
prefix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201310606990.9A
Other languages
Chinese (zh)
Inventor
王智广
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310606990.9A priority Critical patent/CN103631906A/en
Publication of CN103631906A publication Critical patent/CN103631906A/en
Priority to PCT/CN2014/086522 priority patent/WO2015074455A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

本发明公开了一种识别网页URL中页码标识的方法和装置,所述方法包括:获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。本发明基于指定网页的URL中和关联URL计算出关联网页URL模式pattern,计算效率高,采用URL的共性部分进行比较,大幅提高召回率。

Figure 201310606990

The present invention discloses a method and device for identifying a page number in a URL of a webpage. The method includes: obtaining an associated URL linked to a page turning feature anchor in a page element of a specified webpage; according to the URL of the specified webpage and the The associated URL calculates the associated webpage URL pattern pattern; based on the associated webpage URL pattern pattern corresponding to the specified webpage, respectively determine the page number characteristic part of the specified webpage URL and the page number characteristic part in the described associated URL; compare the specified webpage URL and The page number characteristic part of the URL of the associated page extracts different digital identification parts and identifies it as the page number identification of the specified web page URL. The present invention calculates the URL pattern of the associated webpage based on the URL of the specified webpage and the associated URL, has high calculation efficiency, uses the common part of the URL for comparison, and greatly improves the recall rate.

Figure 201310606990

Description

A kind of method and apparatus of identifying page number sign in webpage URL
Technical field
The present invention relates to web data processing technology field, be specifically related to a kind of method of page number sign in webpage URL, a kind of device of identifying page number sign in webpage URL identified.
Background technology
Along with the development of the Internet, more and more many information is to be presented on the Internet and to be inquired about for user by webpage mode, and the same Search engine data query in the Internet that passes through also becomes the data search method the most often using.
During search engine webpage, need to take different scheduling strategies for different types of webpage, the identification of webpage kind is an element task, and the identification of wherein page turning (Page turning) webpage is a more crucial job.So-called page turning webpage, checks a upper page of paging file, the next page or the non-current page existing arbitrarily.Page turning webpage can change the content in entity book or mobile Web forms, to watch different content.While using on the internet, this mechanism also presents the user interface element that can be used for browsing to other pages.
The recognition methods of existing page turning webpage is the URL(Uniform ResourceLocator according to webpage, URL(uniform resource locator)) whether the keyword that comprises is identified be index page.For example, as URL, include while having numeral after the keywords such as page, pn, p and keyword, judge that the webpage that this URL is corresponding is page turning webpage.
But, this recognition methods recall rate is low, and the page turning of a lot of websites is not have these keywords, such as " http://cq.ABC.com/lvshi/o12/ ", " http://bbs.BCA.com/t661_10 ", " http://china.BCD.com/product/20110617/2647 ", but these webpages are still page turnings, make these recognition methodss easily cause maloperation, practicality is low.
Summary of the invention
In view of the above problems, the present invention has been proposed to a kind of a kind of method and corresponding a kind of device of identifying page number sign in webpage URL of identifying page number sign in webpage URL that overcomes the problems referred to above or address the above problem is at least in part provided.
According to one aspect of the present invention, a kind of method of identifying page number sign in webpage URL is provided, comprising:
Obtain the associated URL that in the page elements of named web page, page turning feature anchor correspondence is linked to;
URL and described associated URL compute associations webpage URL pattern pattern according to described named web page;
Associating web pages URL pattern pattern based on corresponding with named web page, determines respectively the page number characteristic of described named web page URL and the page number characteristic in described associated URL;
The page number characteristic of more described named web page URL and described associated page URL, extracts the page number sign that different digital identification division is identified as named web page URL.
Alternatively, the step of obtaining the associated URL that in the page elements of named web page, page turning feature anchor correspondence is linked to described in comprises:
Use page turning feature anchor to mate in the dom tree node of named web page;
When the match is successful, from the page turning feature anchor that the match is successful, obtain associated URL.
Alternatively, the one or more associated URL of the corresponding link of described page turning feature anchor.
Alternatively, the URL of the described named web page of described foundation comprises with the step of described associated URL compute associations webpage URL pattern pattern:
Use wild-character to replace the digital block in the URL of named web page, obtain First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Use wild-character to replace the digital block in described associated URL, obtain Second Characteristic URL prefix;
When described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
Alternatively, the digital block in the URL of described use wild-character replacement named web page, the step that obtains First Characteristic URL prefix is:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described use wild-character is replaced the digital block in described associated URL, and the step that obtains Second Characteristic URL prefix is:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
Alternatively, the digital block in the URL of described use wild-character replacement named web page, the step that obtains First Characteristic URL prefix is:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Described use wild-character is replaced the digital block in described associated URL, and the step that obtains Second Characteristic URL prefix is:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
Alternatively, described page number sign comprises homepage sign, described homepage sign comprise 0,1 and/or current associating web pages in greatest measure.
According to a further aspect in the invention, provide a kind of device of identifying page number sign in webpage URL, having comprised:
Associated URL acquisition module, is suitable for obtaining the associated URL that in the page elements of named web page, page turning feature anchor correspondence is linked to;
Associating web pages URL pattern pattern computing module, is suitable for URL and described associated URL compute associations webpage URL pattern pattern according to described named web page;
Page number characteristic determination module, is suitable for the associating web pages URL pattern pattern based on corresponding with named web page, determines respectively the page number characteristic of described named web page URL and the page number characteristic in described associated URL;
Page number sign determination module, is suitable for the page number characteristic of more described named web page URL and described associated page URL, and the page number that extraction different digital identification division is identified as named web page URL identifies.
Alternatively, described associated URL acquisition module is also suitable for:
Use page turning feature anchor to mate in the dom tree node of named web page;
When the match is successful, from the page turning feature anchor that the match is successful, obtain associated URL.
Alternatively, the one or more associated URL of the corresponding link of described page turning feature anchor.
Alternatively, described associating web pages URL pattern pattern computing module comprises:
First Characteristic URL prefix obtains submodule, and the digital block in the URL that is suitable for using wild-character to replace named web page obtains First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Second Characteristic URL prefix obtains submodule, is suitable for using wild-character to replace the digital block in described associated URL, obtains Second Characteristic URL prefix;
Associating web pages URL pattern pattern obtains submodule, is suitable for when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
Alternatively, described First Characteristic URL prefix acquisition submodule is also suitable for:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described Second Characteristic URL prefix obtains submodule and is also suitable for:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
Alternatively, described First Characteristic URL prefix acquisition submodule is also suitable for:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Described Second Characteristic URL prefix obtains submodule and is also suitable for:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
Alternatively, described page number sign comprises homepage sign, described homepage sign comprise 0,1 and/or current associating web pages in greatest measure.
The present invention adopts page turning feature anchor identification associating web pages, recognition accuracy is high, in URL based on named web page, calculate associating web pages URL pattern pattern with associated URL, counting yield is high, adopt the general character of URL partly to compare, significantly improve recall rate, can identify more than 90% associating web pages in actual applications.
The present invention uses wild-character to replace digital block and obtains First Characteristic URL prefix and obtain Second Characteristic URL prefix, when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern, the present invention adopts the general character of URL partly to mate, and has further improved the recognition accuracy of associating web pages.
The present invention replaces with by the page turning piece of associating web pages URL pattern pattern the URL that homepage sign obtains homepage associating web pages, in like manner, also page turning piece can be replaced with to the URL that other chaining banners obtain other associating web pages, thereby increased the coverage rate of associating web pages, make it possible to obtain more comprehensively associating web pages, and then realized the operation of fine granularity.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to better understand technological means of the present invention, and can be implemented according to the content of instructions, and for above and other objects of the present invention, feature and advantage can be become apparent, below especially exemplified by the specific embodiment of the present invention.
Accompanying drawing explanation
By reading below detailed description of the preferred embodiment, various other advantage and benefits will become cheer and bright for those of ordinary skills.Accompanying drawing is only for the object of preferred implementation is shown, and do not think limitation of the present invention.And in whole accompanying drawing, by identical reference symbol, represent identical parts.In the accompanying drawings:
Fig. 1 shows a kind of according to an embodiment of the invention flow chart of steps of identifying the embodiment of the method for page number sign in webpage URL;
Fig. 2 shows a kind of according to an embodiment of the invention structure of web page exemplary plot;
Fig. 3 shows the exemplary plot of a kind of page turning piece of one embodiment of the invention; And,
Fig. 4 shows a kind of according to an embodiment of the invention structured flowchart of identifying the device embodiment of page number sign in webpage URL.
Embodiment
Exemplary embodiment of the present disclosure is described below with reference to accompanying drawings in more detail.Although shown exemplary embodiment of the present disclosure in accompanying drawing, yet should be appreciated that and can realize the disclosure and the embodiment that should do not set forth limits here with various forms.On the contrary, it is in order more thoroughly to understand the disclosure that these embodiment are provided, and can by the scope of the present disclosure complete convey to those skilled in the art.
With reference to Fig. 1, show a kind of flow chart of steps of identifying the embodiment of the method for page number sign in webpage URL of one embodiment of the invention, specifically can comprise the steps:
Step 101, obtains the associated URL that in the page elements of named web page, page turning feature anchor correspondence is linked to;
Webpage can be divided into a plurality of regions according to function, with (the Bulletin BoardSystem of some forums, BBS) the page is example, as shown in Figure 2, this page can be divided into navigation block (1), executing garbage (2,4), page turning piece (3), title piece (5), author information piece (6), date issued piece (7), text block (8).Wherein, navigation block can be positioned at webpage header top, or the banner of banner(webpage) bottom, be used in reference to the information column to webpage.Executing garbage can be the region with the very low page elements place of the Web page subject degree of correlation, function buttons such as " posting ", " reply ".Page turning piece can be the region of indication page turning.Title piece can be the region at the title of Web page subject (example " secure browser assemble black Thursday " as shown in Figure 2) place.Author information piece is for recording the region of this Web page subject author information.Text block is for recording the region of this Web page subject text.
With reference to Fig. 3, show the exemplary plot of a kind of page turning piece of one embodiment of the invention.
As shown in Figure 3, page turning piece mainly can be comprised of page turning feature anchor, and page turning feature anchor is page turning feature string, and it can be for for identifying the page elements of page turning.
In specific implementation, page turning feature anchor can comprise following one or more:
[<<], [>>], [], [], [<<], [>>], [>], [<], [lower one page], [page up], [upper one], [next], [next], [last page], [endpage], [front page], [rear page], [< page up], [< upper one], [next >], [lower one page >], [1...].
Certainly, above-mentioned page turning feature anchor, just as example, when implementing the embodiment of the present invention, can arrange other page turning features anchor according to actual conditions, and the embodiment of the present invention is not limited this.
In a preferred embodiment of the present invention, described step 101 specifically can comprise following sub-step:
Sub-step S11, is used page turning feature anchor to mate in the dom tree node of named web page;
Sub-step S12 when the match is successful, obtains associated URL from the page turning feature anchor that the match is successful.
DOM(document dbject model, Document Object Model) be the standard program interface of processing extensible markup language.DOM can access and revise the content and structure of a document in a kind of mode that is independent of platform and language, mean and process the common method of a HTML or XML document.
DOM is actually the document model of describing with object-oriented way.DOM has defined and has represented and required object, the behavior of these objects and the relation between attribute and these objects of modification document.DOM can be thought to a tree represenation of data and structure on the page, but the page may not be the mode specific implementation with this tree certainly.
Can the whole html document of reconstruct by JavaScript, can add, remove, change or reset the project on the page.
Change certain thing of the page, JavaScript just needs to obtain the entrance that all elements in html document is conducted interviews.This entrance, together with the method that html element element is added, moves, changed or removes and attribute, all obtains (DOM) by DOM Document Object Model.
Can regard html document as tree construction, and this structure is called as node tree (HTMLDOM).By HTML DOM, all nodes in tree all can conduct interviews by JavaScript.All html element elements (node) all can be modified, and also can create or deletion of node.
Node in node tree has hierarchical relationship each other.Can adopt the terms such as father (parent), son (child) and compatriot (sibling) to be used for describing these relations.Wherein, father node has child node.Child node at the same level is called as compatriot (brothers or sisters).In node tree, top node is called as root (root).Each node has father node, except root (it does not have father node).A node can have the son of any amount, and compatriot is the node that has identical father node.
Specifically can at node tree, search by several method the web page element of wishing operation:
For example, can be by using getElementById () and getElementsByTagName () method to search.
Again for example, can be by using parentNode, firstChild and the lastChild attribute of a node element.
Wherein, these two kinds of methods of getElementById () and getElementsByTagName (), can search any html element element in whole html document.And these two kinds of methods can be ignored the structure of document.If search <p> elements all in document, getElementsByTagName () can all find them, no matter which level of <p> element in document.Meanwhile, getElementById () method also can be returned to correct element, no matter where it is hidden in file structure.These two kinds of methods can provide any needed html element element, no matter their residing positions in document.
In addition, getElementById () can return to web page element by the ID of appointment.
In specific implementation, can be by hyperlink <a>(anchor in the html text dom tree of this webpage of identification, anchor point) whether sign comprises [<<], [>>], [], [], [<<], [>>], [>], [<], [lower one page], [page up], [upper one], [next], [next], [last page], [endpage], [front page], [rear page], [< page up], [< upper one], [next >], [lower one page >], one or more in [1...], if, judge that current web page has page turning feature anchor.
Wherein, <a> can be for being connected to the text of current location or picture other the page, text or image etc.
The basic syntax structure of < a > sign can be as follows:
<a
class=type
id=value
href=reference
name=value
rel=same|next|parent|previous
rev=value
target=window
style=value
title=title
onclick=function
onmouseout=function
Code </a > of > display text or picture for example in following a kind of html text the content of <a> sign be:
<div?id="pgt"class="bm?bw0pgs?cl">
<span?id="fd_page_top">
<div?class="pg">
<a
href="forum-99-1.html"class="prev"></a>
<a
href="forum-99-1.html">1</a><strong>2<>
<a
href="forum-99-3.html">3</a>
<a
href="forum-99-4.html">4</a>
<a
href="forum-99-5.html">5</a>
<a
href="forum-99-6.html">6</a>
<a
href="forum-99-7.html">7</a>
<a
href="forum-99-8.html">8</a>
<a
href="forum-99-9.html">9</a>
<a
href="forum-99-10.html">10</a>
<a
href="forum-99-1000.html"class="last">...2107</a>
<label>
" the input page number, by the quick redirect of carriage return " value=" 2 " if (event.keyCode==13) { window.location='forum.php mod=forumdisplay & fid=99 & page='+this.valu e for <input type=" text " name=" custompage " class=" px " size=" 2 " title=; Doane (event); "/>
<span title=" totally 1000 pages " >/1000 page </span>
</label>
<a
One page </a> under href=" forum-99-3.html " class=" nxt " >
</div>
</span>
Coupling by <a> sign in html text, can judge that this webpage has one or more page turning feature anchor.
In realizing application, described page turning feature anchor can the one or more associated URL of corresponding link.
Particularly, after identifying these one or more page turning feature anchor, extract the one or more associated URL of these one or more page turning feature anchor links, these one or more associated URL point to other the page turning webpage associated with current web page.
Step 102, according to the URL and described associated URL compute associations webpage URL pattern pattern of described named web page;
Associating web pages URL pattern Pattern, the set that can get together and form for appearance or functionally similar URL/ webpage.
In a preferred embodiment of the present invention, described step 102 specifically can comprise following sub-step:
Sub-step S21, is used wild-character to replace the digital block in the URL of named web page, obtains First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Sub-step S31, is used wild-character to replace the digital block in described associated URL, obtains Second Characteristic URL prefix;
It should be noted that, wild-character can be any character, and the embodiment of the present invention is not limited this.Spacing identification can in URL for the symbol at interval, for example "/", ". ", "-", "? ", ": " etc.Digital block need to be numeral continuous in spacing identification, and for example " 123ABC " is not digital block.
In a kind of preferred exemplary of the embodiment of the present invention, described sub-step S21 further can comprise following sub-step:
Sub-step S211, adopts identical wild-character to replace the digital block of diverse location in the URL of named web page, obtains First Characteristic URL prefix;
With sub-step S211 accordingly, described sub-step S31 further can comprise following sub-step:
Sub-step S311, adopts identical wild-character to replace the digital block of diverse location in described associated URL, obtains Second Characteristic URL prefix.
In specific implementation, the URL of named web page can have one or more digital blocks with associated URL, for reducing the operation steps of replacement and the resource occupation of system, can replace digital block with identical wild-character.
For example, the URL of named web page is http://bbs.XXX.com/forum-99-2.html, associated URL is http://bbs.XXX.com/forum-99-3.html, wherein " 99 ", " 2 " are identified is digital block, using " (d+) " a kind of example as wild-character, First Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(d+), and Second Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(d+).
In an embodiment of the present invention, described sub-step S21 further can comprise following sub-step:
Sub-step S212, adopts respectively different substitute characters, and the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
With sub-step S212 accordingly, described sub-step S31 further can comprise following sub-step:
Sub-step S312, adopts respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtains Second Characteristic URL prefix.
In specific implementation, the URL of named web page can have one or more digital blocks with associated URL, for improving judgement and the efficiency to the sign of digital block whether follow-up First Characteristic URL prefix is identical with Second Characteristic URL, can adopt different wild-characters to replace digital block.
For example, the URL of named web page is http://bbs.XXX.com/forum-99-2.html, associated URL is http://bbs.XXX.com/forum-99-3.html, wherein " 99 ", " 2 " are identified is digital block, with " (d+) ", " (e+) " a kind of example of character as an alternative, First Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(e+), and Second Characteristic URL prefix can be the .html of http://bbs.XXX.com/forum-(d+)-(e+).
Sub-step S41, when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
In actual applications, when First Characteristic URL prefix is identical with Second Characteristic URL prefix, the webpage corresponding with associated URL that can judge named web page is associated page turning webpage.
Because First Characteristic URL prefix is identical with Second Characteristic URL, using First Characteristic URL prefix or Second Characteristic URL prefix all can as associating web pages URL pattern Pattern.
Step 103, the associating web pages URL pattern pattern based on corresponding with named web page, determines respectively the page number characteristic of described named web page URL and the page number characteristic in described associated URL;
In actual applications, URL can comprise one or more following structures:
1, protocol(agreement): specify the host-host protocol using, the most frequently used is http protocol, and it is also agreement most widely used in current WWW.Particularly, host-host protocol comprises that (resource is the file on local computer to file agreement, form is file: // /), ftp agreement is (by FTP access resources, form is FTP: //), gopher(is by Gopher protocol access resource), http agreement is (by HTTP access resources, form is HTTP: //), https agreement (by the HTTPS access resources of safety, form is HTTPS: //) etc.
2, hostname(host name): domain name system (DNS) host name or the IP address that refer to deposit the server of resource.Sometimes, before host name, also can comprise and be connected to the required username and password of server (form is username:password).
3, port(port numbers): the default port of operational version during omission, various host-host protocols have the port numbers of acquiescence, if the default port of http is 80.If omit during input, use default port number.Sometimes for safety or other, consider, can on server, to port, redefine, adopt non-standard ports number, now, in URL, just can not omit port numbers this.
4, path(path): by zero or the character string that separates of a plurality of "/" symbols, be generally used for representing catalogue or file address on main frame.
5, parameters(parameter): the option that can be used to specify special parameter.
6, query (inquiry): can be for giving dynamic web page (as used the webpage of the fabrication techniques such as CGI, ISAPI, PHP/JSP/ASP/ASP.NET) Transfer Parameters, can there be a plurality of parameters, with " & " symbol, separate, name and the value of each parameter separate with "=" symbol.
7, fragment(pieces of information): can be used to specify the segment in Internet resources.For example in a webpage, there are a plurality of explanations of nouns, can use fragment to be directly targeted to a certain explanation of nouns.
In specific implementation, by the general character in a plurality of associating web pages URL patterns is partly carried out to structure analysis, extract the page turning piece in associating web pages URL pattern, then described page turning piece is replaced with to the URL that homepage sign obtains homepage associating web pages.
By the general character in associating web pages URL pattern pattern is partly carried out to structure analysis, can determine the page number characteristic in associating web pages URL pattern pattern, be page turning piece, be specifically as follows in a plurality of associating web pages URL pattern pattern the identical but digital different digital block in position.
Step 104, the page number characteristic of more described named web page URL and described associated page URL, extracts the page number sign that different digital identification division is identified as named web page URL.
In specific implementation, described page number sign can comprise homepage sign, described homepage sign can comprise 0,1 and/or current associating web pages in greatest measure.
After page turning piece in extracting associating web pages URL pattern, described page turning piece can be replaced with to the URL that homepage sign obtains homepage associating web pages.
For example, for associating web pages URL pattern-http://bbs.XXX.com/forum-of above-mentioned example (d+)-(e+) .html, (e+) is page turning piece identifying, then page turning piece is replaced with after homepage sign, obtain URL-http://bbs.XXX.com/forum-99-1.html of homepage associating web pages.
In a kind of preferred exemplary of the embodiment of the present invention, described homepage sign can comprise 0,1 and/or current associating web pages in greatest measure.
In specific implementation, the homepage associating web pages in associating web pages generally can record important content, example text block as shown in Figure 3, so the important ratio of homepage associating web pages is higher, therefore knows that homepage associating web pages has important meaning.And different websites can adopt different page turning structures, caused the difference of homepage associating web pages.For example, some website can adopt the 0th page as homepage associating web pages, and some website can adopt the 1st page as homepage associating web pages, and some website can adopt maximum page (example as shown in Figure 3 2100) as homepage associating web pages, etc.
Certainly, above-mentioned homepage associating web pages is just as example, and when implementing the embodiment of the present invention, the sign that can numeral be replaced with to arbitrary associating web pages soon according to actual conditions is obtained corresponding associating web pages, and the embodiment of the present invention is not described in detail one by one to this.
The present invention adopts page turning feature anchor identification associating web pages, recognition accuracy is high, in URL based on named web page, calculate associating web pages URL pattern pattern with associated URL, counting yield is high, adopt the general character of URL partly to compare, significantly improve recall rate, can identify more than 90% associating web pages in actual applications.
The present invention uses wild-character to replace digital block and obtains First Characteristic URL prefix and obtain Second Characteristic URL prefix, when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern, the present invention adopts the general character of URL partly to mate, and has further improved the recognition accuracy of associating web pages.
The present invention replaces with by the page turning piece of associating web pages URL pattern pattern the URL that homepage sign obtains homepage associating web pages, in like manner, also page turning piece can be replaced with to the URL that other chaining banners obtain other associating web pages, thereby increased the coverage rate of associating web pages, make it possible to obtain more comprehensively associating web pages, and then realized the operation of fine granularity.
For embodiment of the method, for simple description, therefore it is all expressed as to a series of combination of actions, but those skilled in the art should know, the present invention is not subject to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to Fig. 4, show a kind of structured flowchart of identifying the device embodiment of page number sign in webpage URL of one embodiment of the invention, specifically can comprise as lower module:
Associated URL acquisition module 401, is suitable for obtaining the associated URL that in the page elements of named web page, page turning feature anchor correspondence is linked to;
Associating web pages URL pattern pattern computing module 402, is suitable for URL and described associated URL compute associations webpage URL pattern pattern according to described named web page;
Page number characteristic determination module 403, is suitable for the associating web pages URL pattern pattern based on corresponding with named web page, determines respectively the page number characteristic of described named web page URL and the page number characteristic in described associated URL;
Page number sign determination module 404, is suitable for the page number characteristic of more described named web page URL and described associated page URL, and the page number that extraction different digital identification division is identified as named web page URL identifies.
In a preferred embodiment of the present invention, described associated URL acquisition module 401 can also be suitable for:
Use page turning feature anchor to mate in the dom tree node of named web page;
When the match is successful, from the page turning feature anchor that the match is successful, obtain associated URL.
In a preferred embodiment of the present invention, described page turning feature anchor can the one or more associated URL of corresponding link.
In a preferred embodiment of the present invention, described associating web pages URL pattern pattern computing module 402 specifically comprises following submodule:
First Characteristic URL prefix obtains submodule, and the digital block in the URL that is suitable for using wild-character to replace named web page obtains First Characteristic URL prefix; Wherein, described digital block is to be spaced apart individual digit or a plurality of numeral that sign is partitioned into;
Second Characteristic URL prefix obtains submodule, is suitable for using wild-character to replace the digital block in described associated URL, obtains Second Characteristic URL prefix;
Associating web pages URL pattern pattern obtains submodule, is suitable for when described First Characteristic URL prefix is identical with described Second Characteristic URL prefix, using described First Characteristic URL prefix or Second Characteristic URL prefix as associating web pages URL pattern pattern.
In a preferred embodiment of the present invention, described First Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt identical wild-character to replace the digital block of diverse location in the URL of named web page, obtain First Characteristic URL prefix;
Described Second Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt identical wild-character to replace the digital block of diverse location in described associated URL, obtain Second Characteristic URL prefix.
In a preferred embodiment of the present invention, described First Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt respectively different wild-characters, the digital block of diverse location in the URL of replacement named web page, obtains First Characteristic URL prefix;
Described Second Characteristic URL prefix obtains submodule and can also be suitable for:
Adopt respectively the wild-character identical with First Characteristic URL to replace described associated URL at the digital block of same position, obtain Second Characteristic URL prefix.
In a preferred embodiment of the present invention, described page number sign can comprise homepage sign, described homepage sign can comprise 0,1 and/or current associating web pages in greatest measure.
For the device embodiment of Fig. 4, because it is substantially similar to the embodiment of the method for Fig. 1, so description is fairly simple, relevant part is referring to the part explanation of embodiment of the method.
The algorithm providing at this is intrinsic not relevant to any certain computer, virtual system or miscellaneous equipment with demonstration.Various general-purpose systems also can with based on using together with this teaching.According to description above, it is apparent constructing the desired structure of this type systematic.In addition, the present invention is not also for any certain programmed language.It should be understood that and can utilize various programming languages to realize content of the present invention described here, and the description of above language-specific being done is in order to disclose preferred forms of the present invention.
In the instructions that provided herein, a large amount of details have been described.Yet, can understand, embodiments of the invention can not put into practice in the situation that there is no these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand one or more in each inventive aspect, in the above in the description of exemplary embodiment of the present invention, each feature of the present invention is grouped together into single embodiment, figure or sometimes in its description.Yet, the method for the disclosure should be construed to the following intention of reflection: the present invention for required protection requires than the more feature of feature of clearly recording in each claim.Or rather, as reflected in claims below, inventive aspect is to be less than all features of disclosed single embodiment above.Therefore, claims of following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and can the module in the equipment in embodiment are adaptively changed and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and can put them into a plurality of submodules or subelement or sub-component in addition.At least some in such feature and/or process or unit are mutually repelling, and can adopt any combination to combine all processes or the unit of disclosed all features in this instructions (comprising claim, summary and the accompanying drawing followed) and disclosed any method like this or equipment.Unless clearly statement in addition, in this instructions (comprising claim, summary and the accompanying drawing followed) disclosed each feature can be by providing identical, be equal to or the alternative features of similar object replaces.
In addition, those skilled in the art can understand, although embodiment more described herein comprise some feature rather than further feature included in other embodiment, the combination of the feature of different embodiment means within scope of the present invention and forms different embodiment.For example, in the following claims, the one of any of embodiment required for protection can be used with array mode arbitrarily.
All parts embodiment of the present invention can realize with hardware, or realizes with the software module moved on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that and can use in practice microprocessor or digital signal processor (DSP) to realize according to the some or all functions of the some or all parts in the equipment of page number sign in the identification webpage URL of the embodiment of the present invention.The present invention for example can also be embodied as, for carrying out part or all equipment or device program (, computer program and computer program) of method as described herein.Realizing program of the present invention and can be stored on computer-readable medium like this, or can there is the form of one or more signal.Such signal can be downloaded and obtain from internet website, or provides on carrier signal, or provides with any other form.
It should be noted above-described embodiment the present invention will be described rather than limit the invention, and those skilled in the art can design alternative embodiment in the situation that do not depart from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and is not listed as element or step in the claims.Being positioned at word " " before element or " one " does not get rid of and has a plurality of such elements.The present invention can be by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In having enumerated the unit claim of some devices, several in these devices can be to carry out imbody by same hardware branch.The use of word first, second and C grade does not represent any order.Can be title by these word explanations.

Claims (10)

1.一种识别网页URL中页码标识的方法,包括:1. A method for identifying page number identification in a webpage URL, comprising: 获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;Obtain the associated URL corresponding to the link of the page turning feature anchor in the page element of the specified web page; 依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;calculating an associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL; 基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;Based on the associated webpage URL pattern pattern corresponding to the specified webpage, respectively determine the page number characteristic part of the specified webpage URL and the page number characteristic part in the associated URL; 比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。Comparing the page number feature part of the URL of the specified web page and the URL of the associated page, extracting different digital identification parts to be identified as the page number identification of the URL of the specified web page. 2.如权利要求1所述的方法,其特征在于,所述获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL的步骤包括:2. The method according to claim 1, wherein the step of obtaining the associated URL corresponding to the link to the page turning feature anchor in the page element of the specified web page comprises: 使用翻页特征anchor在指定网页的DOM树节点中进行匹配;Use the page turning feature anchor to match in the DOM tree node of the specified web page; 当匹配成功时,则从匹配成功的翻页特征anchor中获取关联URL。When the match is successful, the associated URL is obtained from the successfully matched page turning feature anchor. 3.如权利要求1所述的方法,其特征在于,所述翻页特征anchor对应链接一个或多个关联URL。3. The method according to claim 1, wherein the page-turning feature anchor is correspondingly linked to one or more associated URLs. 4.如权利要求1或2或3所述的方法,其特征在于,所述依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern的步骤包括:4. The method according to claim 1, 2 or 3, wherein the step of calculating the associated web page URL pattern pattern according to the URL of the specified web page and the associated URL comprises: 使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀;其中,所述数字块为被间隔标识分割出的单个数字或多个数字;Use wildcard characters to replace the number block in the URL of the specified web page to obtain the first characteristic URL prefix; wherein, the number block is a single number or a plurality of numbers separated by an interval mark; 使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀;Using wildcard characters to replace the digital blocks in the associated URL to obtain the second characteristic URL prefix; 当所述第一特征URL前缀与所述第二特征URL前缀相同时,将所述第一特征URL前缀或第二特征URL前缀作为关联网页URL模式pattern。When the first characteristic URL prefix is the same as the second characteristic URL prefix, the first characteristic URL prefix or the second characteristic URL prefix is used as an associated webpage URL pattern pattern. 5.如权利要求4所述的方法,其特征在于,所述使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀的步骤为:5. method as claimed in claim 4, is characterized in that, described use wildcard character to replace the digital block in the URL of designated webpage, obtains the step of first characteristic URL prefix as: 采用相同的通配字符替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Using the same wildcard character to replace the number blocks at different positions in the URL of the specified web page to obtain the first characteristic URL prefix; 所述使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀的步骤为:The step of using wildcard characters to replace the digital block in the associated URL to obtain the second characteristic URL prefix is: 采用相同的通配字符替换所述关联URL中不同位置的数字块,获得第二特征URL前缀。The same wildcard characters are used to replace the number blocks at different positions in the associated URL to obtain the second characteristic URL prefix. 6.如权利要求4所述的方法,其特征在于,所述使用通配字符替换指定网页的URL中的数字块,获得第一特征URL前缀的步骤为:6. The method according to claim 4, characterized in that, the described use of wildcard characters to replace the digital block in the URL of the specified webpage, and the step of obtaining the first characteristic URL prefix is: 分别采用不同的通配字符,替换指定网页的URL中不同位置的数字块,获得第一特征URL前缀;Different wildcard characters are used to replace the number blocks at different positions in the URL of the specified web page to obtain the first characteristic URL prefix; 所述使用通配字符替换所述关联URL中的数字块,获得第二特征URL前缀的步骤为:The step of using wildcard characters to replace the digital block in the associated URL to obtain the second characteristic URL prefix is: 分别采用与第一特征URL相同的通配字符替换所述关联URL在相同位置的数字块,获得第二特征URL前缀。The number blocks at the same positions of the associated URL are replaced with the same wildcard characters as those of the first characteristic URL to obtain the second characteristic URL prefix. 7.如权利要求1所述的方法,其特征在于,所述页码标识包括首页标识,所述首页标识包括0、1和/或当前关联网页中的最大数值。7. The method according to claim 1, wherein the page number identifier includes a homepage identifier, and the homepage identifier includes 0, 1, and/or the largest numerical value in the currently associated webpage. 8.一种识别网页URL中页码标识的装置,包括:8. A device for identifying page number identification in a web page URL, comprising: 关联URL获取模块,适于获取指定网页的页面元素中翻页特征anchor对应链接到的关联URL;An associated URL obtaining module, adapted to obtain the associated URL corresponding to the link to the page turning feature anchor in the page element of the specified web page; 关联网页URL模式pattern计算模块,适于依据所述指定网页的URL和所述关联URL计算关联网页URL模式pattern;The associated webpage URL pattern pattern calculation module is adapted to calculate the associated webpage URL pattern pattern according to the URL of the specified webpage and the associated URL; 页码特征部分确定模块,适于基于与指定网页对应的关联网页URL模式pattern,分别确定所述指定网页URL的页码特征部分以及所述关联URL中的页码特征部分;The page number feature part determination module is adapted to determine the page number feature part of the specified web page URL and the page number feature part in the associated URL based on the associated web page URL pattern pattern corresponding to the specified web page; 页码标识确定模块,适于比较所述指定网页URL与所述关联页URL的页码特征部分,提取不同数字标识部分识别为指定网页URL的页码标识。The page number identification determination module is adapted to compare the page number characteristic parts of the URL of the specified web page and the URL of the associated page, and extract different digital identification parts to identify the page number identification of the URL of the specified web page. 9.如权利要求8所述的装置,其特征在于,所述关联URL获取模块还适于:9. The device according to claim 8, wherein the associated URL acquisition module is further adapted to: 使用翻页特征anchor在指定网页的DOM树节点中进行匹配;Use the page turning feature anchor to match in the DOM tree node of the specified web page; 当匹配成功时,则从匹配成功的翻页特征anchor中获取关联URL。When the match is successful, the associated URL is obtained from the successfully matched page turning feature anchor. 10.如权利要求8所述的装置,其特征在于,所述翻页特征anchor对应链接一个或多个关联URL。10. The device according to claim 8, wherein the page-turning feature anchor corresponds to linking one or more associated URLs.
CN201310606990.9A 2013-11-25 2013-11-25 Method and device for recognizing page number identification in webpage URL Pending CN103631906A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201310606990.9A CN103631906A (en) 2013-11-25 2013-11-25 Method and device for recognizing page number identification in webpage URL
PCT/CN2014/086522 WO2015074455A1 (en) 2013-11-25 2014-09-15 Method and apparatus for computing url pattern of associated webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310606990.9A CN103631906A (en) 2013-11-25 2013-11-25 Method and device for recognizing page number identification in webpage URL

Publications (1)

Publication Number Publication Date
CN103631906A true CN103631906A (en) 2014-03-12

Family

ID=50212947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310606990.9A Pending CN103631906A (en) 2013-11-25 2013-11-25 Method and device for recognizing page number identification in webpage URL

Country Status (1)

Country Link
CN (1) CN103631906A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015074455A1 (en) * 2013-11-25 2015-05-28 北京奇虎科技有限公司 Method and apparatus for computing url pattern of associated webpage
CN104965902A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Enriched URL (uniform resource locator) recognition method and apparatus
WO2015169193A1 (en) * 2014-05-04 2015-11-12 丘炎卫 Ptp interaction association system supporting connection between multimedia electronic product and internet
CN105095386A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Device and method for determining web page quality
CN108182398A (en) * 2017-12-26 2018-06-19 广东金赋科技股份有限公司 The method and device in the direction based on scanning device adjustment scan image
CN114943023A (en) * 2022-01-25 2022-08-26 北京金堤科技有限公司 Page turning logic acquisition method and device and website page turning control method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053979A (en) * 2009-10-27 2011-05-11 华为技术有限公司 Information acquisition method and system
CN102123168A (en) * 2011-01-14 2011-07-13 广州市动景计算机科技有限公司 Web page pre-reading and integration method and system based on relay server
CN102567407A (en) * 2010-12-22 2012-07-11 北大方正集团有限公司 Method and system for collecting forum reply increment
CN102810110A (en) * 2012-05-07 2012-12-05 北京京东世纪贸易有限公司 Method and system for acquiring web text data
CN103049557A (en) * 2012-12-31 2013-04-17 百度在线网络技术(北京)有限公司 Website resource management method and website resource management device
CN103150358A (en) * 2013-02-27 2013-06-12 三星半导体(中国)研究开发有限公司 Device and method capable of performing continuous web browsing in mobile equipment
CN103258032A (en) * 2013-05-10 2013-08-21 清华大学 Parallel webpage obtaining method and parallel webpage obtaining device

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102053979A (en) * 2009-10-27 2011-05-11 华为技术有限公司 Information acquisition method and system
CN102567407A (en) * 2010-12-22 2012-07-11 北大方正集团有限公司 Method and system for collecting forum reply increment
CN102123168A (en) * 2011-01-14 2011-07-13 广州市动景计算机科技有限公司 Web page pre-reading and integration method and system based on relay server
CN102810110A (en) * 2012-05-07 2012-12-05 北京京东世纪贸易有限公司 Method and system for acquiring web text data
CN103049557A (en) * 2012-12-31 2013-04-17 百度在线网络技术(北京)有限公司 Website resource management method and website resource management device
CN103150358A (en) * 2013-02-27 2013-06-12 三星半导体(中国)研究开发有限公司 Device and method capable of performing continuous web browsing in mobile equipment
CN103258032A (en) * 2013-05-10 2013-08-21 清华大学 Parallel webpage obtaining method and parallel webpage obtaining device

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015074455A1 (en) * 2013-11-25 2015-05-28 北京奇虎科技有限公司 Method and apparatus for computing url pattern of associated webpage
WO2015169193A1 (en) * 2014-05-04 2015-11-12 丘炎卫 Ptp interaction association system supporting connection between multimedia electronic product and internet
CN104965902A (en) * 2015-06-30 2015-10-07 北京奇虎科技有限公司 Enriched URL (uniform resource locator) recognition method and apparatus
CN105095386A (en) * 2015-06-30 2015-11-25 北京奇虎科技有限公司 Device and method for determining web page quality
CN108182398A (en) * 2017-12-26 2018-06-19 广东金赋科技股份有限公司 The method and device in the direction based on scanning device adjustment scan image
CN114943023A (en) * 2022-01-25 2022-08-26 北京金堤科技有限公司 Page turning logic acquisition method and device and website page turning control method and device
CN114943023B (en) * 2022-01-25 2025-06-27 北京金堤科技有限公司 Page turning logic acquisition, and website page turning control method and device

Similar Documents

Publication Publication Date Title
US10698960B2 (en) Content validation and coding for search engine optimization
CN107066576B (en) A big data web crawler page selection method and system
CN101534306B (en) Detecting method and a device for fishing website
CN101211364B (en) Method and system for social bookmarking of resources exposed in web pages
US8874542B2 (en) Displaying browse sequence with search results
CN100442283C (en) Extraction method and system of structured data of internet based on sample &amp; faced to regime
CN103631906A (en) Method and device for recognizing page number identification in webpage URL
CN107016102B (en) A Paging Configuration Method for Big Data Web Crawler
CN102436563A (en) Method and device for detecting page tampering
CN106844635A (en) The edit methods and device of the element in webpage
CN102664925A (en) A method and device for displaying search results
CN102880711A (en) Processing method and processing device for input data in browser address bar
CN103034707A (en) Website navigation method, device and browser client
CN102314494B (en) Method and equipment for processing webpage contents
CN103678509A (en) Method and device for generating webpage template
CN103389972A (en) Method and device for obtaining text based on really simple syndication (RSS)
CN103678510A (en) Method and device for providing visualized label for webpage
CN103617225B (en) Method and system for searching related web pages
CN103618742A (en) Method and system for acquiring sub domain names and webmaster permission verification method
CN102567521A (en) Webpage data capturing and filtering method
CN103617228A (en) Method and device for calculating relevant webpage URL pattern
CN102970339A (en) Method for displaying web address and browser
CN105306462A (en) Web page link detecting method and device
CN103617229A (en) Method and device for establishing relevant-webpage data base
WO2015074455A1 (en) Method and apparatus for computing url pattern of associated webpage

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140312

RJ01 Rejection of invention patent application after publication