CN108446287A

CN108446287A - Web page crawl method and device

Info

Publication number: CN108446287A
Application number: CN201710085587.4A
Authority: CN
Inventors: 余清富
Original assignee: Beijing Gridsum Technology Co Ltd
Current assignee: Beijing Gridsum Technology Co Ltd
Priority date: 2017-02-16
Filing date: 2017-02-16
Publication date: 2018-08-24

Abstract

The invention discloses a kind of web page crawl method and devices.Wherein, this method includes：The access request for accessing the first website is intercepted from the first website, wherein the source of access request is the second website；The information of the second website is obtained from the uniform resource locator in source, wherein the information of the second website includes account of the user to be crawled in the second website；According to the information of the second website webpage is crawled from the second website.The present invention solves the technical issues of can not accurately obtaining web data.

Description

Web page crawl method and device

Technical field

The present invention relates to Web Page Processing fields, in particular to a kind of web page crawl method and device.

Background technology

Microblogging occupies consequence in the social platform of China Internet, possesses the user base of huge quantity, Also the numerous data information of same number is brought, monitoring is carried out to numerous data in microblogging in order to carry out, needs to obtain micro- The data of webpage where rich, but in the prior art, as website monitoring data acquisition side, due to the webpage on microblogging website Data are larger, and to which the difficulty for obtaining web data is larger, and the data obtained are also inaccurate.

For above-mentioned the problem of can not accurately obtaining web data, currently no effective solution has been proposed.

Invention content

An embodiment of the present invention provides a kind of web page crawl method and devices, at least to solve accurately obtain webpage The technical issues of data.

One side according to the ... of the embodiment of the present invention, providing a kind of web page crawl method includes：It is intercepted from the first website Access the access request of first website, wherein the source of the access request is the second website；Unification from the source The information of second website is obtained in Resource Locator, wherein the information of second website includes user to be crawled in institute State the account information of the second website；According to the information of second website webpage is crawled from second website.

Further, crawling webpage from second website according to described information includes：Pass through crawling in multithreading Thread logs in second website using predetermined account, wherein the multithreading include it is multiple it is described crawl thread, it is each described It crawls thread and corresponds to a predetermined account；After logging in second website using predetermined account, described second is used The account of website crawls webpage from second website.

Further, crawling webpage from second website using the account of second website includes：It obtains advance The restricted information of configuration；Thread is crawled described in control to be swashed from second website according to the access speed in the restricted information Take webpage.

Further, further include using predetermined account login second website by the thread that crawls in multithreading：For Per thread in the multithreading binds a fixed network address.

Further, in the case where login second website needs identifying code, the predetermined account is used to log in institute It includes at least one of to state the second website：Identifying code, which is inputted, according to predetermined manner logs in described second using the predetermined account Website；The identifying code that occurs with graphic form is obtained, the identifying code in the picture is identified, and according to identifying The identifying code log in second website using the predetermined account.

Further, to the identifying code in the picture be identified including：According to data model in the picture Identifying code is identified, wherein the data model trains to obtain according to multiple training datas, the training data packet It includes：The identifying code picture identifying code corresponding with the identifying code picture of second website got in advance.

Further, to the identifying code in the picture be identified including：Obtain multiple features letter in the picture Breath, wherein the characteristic information is used to distinguish the background of the identifying code and the picture；According to the multiple characteristic information pair Identifying code in the picture is identified.

Other side according to the ... of the embodiment of the present invention provides a kind of web page crawl device, including：Interception unit is used In the access request for accessing first website from the interception of the first website, wherein the source of the access request is the second website； Acquiring unit, the information for obtaining second website from the uniform resource locator in the source, wherein described second The information of website includes account information of the user to be crawled in second website；Unit is crawled, for according to second net The information stood crawls webpage from second website.

Further, the unit that crawls includes：Login module, for being made a reservation for by the thread use that crawls in multithreading Account logs in second website, wherein the multithreading include it is multiple it is described crawl thread, it is each described to crawl thread correspondence One predetermined account；Module is crawled, for after logging in second website using predetermined account, using described second The account of website crawls webpage from second website.

Further, the module that crawls includes：First acquisition module, for obtaining preconfigured restricted information；Control Molding block described crawl thread and from second website crawls net according to the access speed in the restricted information for controlling Page.

Further, the login module further includes：Binding module, for being bound for the per thread in the multithreading One fixed network address.

Further, in the case where logging in second website and needing identifying code, the login module include with down toward It is one of few：Predetermined authentication module logs in second net for inputting identifying code according to predetermined manner using the predetermined account It stands；Automatic authentication module knows the identifying code in the picture for obtaining the identifying code occurred with graphic form Not, and according to the identifying code identified using the predetermined account second website is logged in.

Further, automatic authentication module includes：Identification module is used for according to data model to the verification in the picture Code is identified, wherein the data model trains to obtain according to multiple training datas, and the training data includes：In advance The identifying code picture identifying code corresponding with the identifying code picture of second website first got.

Further, the identification module includes：Second acquisition module, for obtaining the letter of multiple features in the picture Breath, wherein the characteristic information is used to distinguish the background of the identifying code and the picture；Identify that submodule includes according to The identifying code in the picture is identified in multiple characteristic informations.

In embodiments of the present invention, the access that the first website of access that source is the second website is intercepted from the first website is asked It asks, then obtains the information of the second website from the uniform resource locator in the source of above-mentioned access request, and with the of acquisition The information of two websites crawls webpage as the entrance crawled from the second website, and the content of analyzing web page can be obtained accurately The data of website and webpage solve the technical issues of can not accurately obtaining web data.

Description of the drawings

Attached drawing described herein is used to provide further understanding of the present invention, and is constituted part of this application, this hair Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings：

Fig. 1 is a kind of flow chart of web page crawl method according to the ... of the embodiment of the present invention；

Fig. 2 is a kind of flow chart of optional web page crawl method according to the ... of the embodiment of the present invention；And

Fig. 3 is a kind of schematic diagram of web page crawl device according to the ... of the embodiment of the present invention.

Specific implementation mode

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The every other embodiment that member is obtained without making creative work should all belong to the model that the present invention protects It encloses.

It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, " Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product Or the other steps or unit that equipment is intrinsic.

According to embodiments of the present invention, a kind of web page crawl embodiment of the method is provided, it should be noted that in the stream of attached drawing The step of journey illustrates can execute in the computer system of such as a group of computer-executable instructions, although also, flowing Logical order is shown in journey figure, but in some cases, it can be with different from shown or described by sequence execution herein The step of.

Fig. 1 is a kind of flow chart of web page crawl method according to the ... of the embodiment of the present invention, as shown in Figure 1, this method includes Following steps：

Step S102 intercepts the access request for accessing the first website from the first website, wherein the source of access request is the Two websites；

Step S104 obtains the information of the second website from the uniform resource locator in source, wherein the letter of the second website Breath includes account information of the user to be crawled in the second website；

Step S106 crawls webpage according to the information of the second website from the second website.

In the above-described embodiments, the access that the first website of access that source is the second website is intercepted from the first website is asked It asks, then obtains the information of the second website from the uniform resource locator in the source of above-mentioned access request, and with the of acquisition The information of two websites crawls webpage as the entrance crawled from the second website, and the content of analyzing web page can be obtained accurately The data of website and webpage solve the technical issues of can not accurately obtaining web data.

Optionally, by uniform resource locator, the first website visiting of publication access on the second website can be obtained and asked Seek the account of information.For example, some account on the second website has issued the chain that can access the first website on the second website It is grounded location, the i.e. uniform resource locator of the first website, other accounts on the second website is allow to be sent out by the chained address Go out to access the access request of the first website.To according to the access request for accessing the first webpage, can reversely be analyzed, be obtained The account information for issuing the second website of the chained address of the first website, convenient for subsequently crawling the operation of the account place page, It can be according to the account information of the interpretation of result crawled second website.

Optionally, obtaining the access request of the first website can carry out in the server end of the first website, from the first website Server end interception access whole requests of the first website, and it is the second website to extract source from whole requests of acquisition Part access request.

Optionally, the data that webpage can use web crawlers to crawl the second website and webpage are crawled on the second website, In, web crawlers is a kind of according to certain rule, automatically captures the program or script of webpage information on the second website.It crawls Process can find webpage in the chained address of webpage by web crawlers, since some webpage of website read webpage Content, find other chained addresses in webpage, next webpage then found by the chained address, and continue cycling through down It goes, until all webpages in this website have all been captured.

As an optional embodiment, crawling webpage from the second website according to information includes：By in multithreading Crawl thread using predetermined account log in the second website, wherein multithreading include it is multiple crawl thread, each crawl thread correspondence One predetermined account；After logging in the second website using predetermined account, swashed from the second website using the account of the second website Take webpage.

Specifically, webpage is crawled from the second website according to the information of the second website of acquisition, can be climbed using multithreading The pattern taken, wherein the thread that crawls in multithreading logs in the second website, and each line in multithreading using predetermined account Journey corresponds to an account respectively, and crawling for webpage can be carried out using the corresponding account of each thread in multithreading by crawling operation, Allow to crawl thread and webpage is crawled from the second website row according to information, is held using the corresponding multiple second websites account of multithreading Row crawls operation, can obtain webpage information on the second website quickly.

Optionally, the operation of webpage is crawled from the second website according to information using multithreading, wherein can use multi-thread The corresponding account of per thread in journey is carried out at the same time crawling for webpage, can accelerate to crawl speed.

In an alternative embodiment, crawling webpage from the second website using the account of the second website includes：It obtains Preconfigured restricted information；Control crawls thread and crawls webpage from the second website according to the access speed in restricted information.

Specifically, the process of webpage is crawled from the second website, can first obtain it is preconfigured for limit crawl behaviour The restricted information of work, and the access speed that corresponding the second website of account pair of thread accesses is crawled according to restricted information control Degree, allows to crawl thread and crawls webpage from the second website according to the access speed in restricted information, can hide the second net It stands to account in the limitation of unit interval access times, allows to crawl operation persistently progress.

Optionally, further include using the second website of predetermined account login by the thread that crawls in multithreading：For multithreading In per thread bind a fixed network address.

Specifically, a fixed network address is bound to the per thread in multithreading, so that per thread is had independent Network address equally also makes the account of the second website of login in per thread have independent network address, to hide the Two websites use different accounts the detection of same network address such case, allow to crawl operation persistently progress.

A kind of optional embodiment logs in the in the case where logging in the second website and needing identifying code using predetermined account Two websites include at least one of：Identifying code, which is inputted, according to predetermined manner logs in the second website using predetermined account；Obtain with The identifying code that graphic form occurs, is identified the identifying code in picture, and uses predetermined account according to the identifying code identified Number log in the second website.

Specifically, during predetermined account logs in the second website, it is understood that there may be predetermined account needs to input identifying code Predetermined manner input identifying code may be used to complete predetermined account second in these cases in the case where capable of completing to log in The login process of website, wherein predetermined manner can be manually entered the mode of identifying code；In addition, in these cases, may be used also By using completing login process of the predetermined account in the second website in a manner of automatic identification identifying code, wherein automatic identification is verified The mode of code can first obtain the identifying code occurred with graphic form, be identified to the identifying code in picture, and according to knowledge The identifying code not gone out complete predetermined account the second website login process.By the login mode of above-described embodiment, can make Predetermined account can log in the second website, and operation completion preparation is crawled to be subsequent.

As an optional embodiment, to the identifying code in picture be identified including：According to data model to picture In identifying code be identified, wherein data model trains to obtain according to multiple training datas, and training data includes：In advance The identifying code picture of the second website first got identifying code corresponding with the identifying code picture.

Specifically, the identifying code in the picture of acquisition is identified, the data model being previously obtained can be used to figure Identifying code in piece is identified, wherein data model can be trained to obtain by multiple training datas, for instructing Experienced training data can be the identifying code picture of the second website obtained in advance, and corresponding with each identifying code picture test Demonstrate,prove code.

In an alternative embodiment, to the identifying code in picture be identified including：Obtain multiple spies in picture Reference ceases, wherein characteristic information is used to distinguish the background of identifying code and picture；According to multiple characteristic informations to the verification in picture Code is identified.

Specifically, the identifying code in the picture of acquisition is identified, can be believed by the multiple features obtained in picture Breath, the background of identifying code and picture in picture is distinguished, and is completed according to the characteristic information of identifying code to being tested in picture Demonstrate,prove the identification of code.

Fig. 2 is a kind of flow chart of optional web page crawl method according to the ... of the embodiment of the present invention, as shown in Fig. 2, the party Method includes the following steps：

Step S202 logs in the account of the second website；

Step S204 accesses webpage where account to be crawled；

Step S206 crawls the access webpage.

Through the above steps, the account for first logging in the second website reuses the account and accesses webpage where account to be crawled, And it crawls tool using web crawlers etc. and starts to crawl operation on the webpage of access.

It should be noted that account to be crawled is determined before carrying out crawling operation, pass through the service in the second website Device end filters out the access request for being directed toward the first website, and the visit is obtained further according to the uniform resource locator in the access request source It asks the account for asking corresponding second website, and using the account of the second website of acquisition as the entrance crawled, crawls the second net The webpage stood.

Optionally, in the second website, the access request for accessing the first website can come from multiple accounts in the second website Number, in order to obtain such account, the server end in the second website is filtered, and the account that will filter out is used as account to be crawled Number.

It during crawling webpage from the second website, can be limited by the second website, for example, it is desired in the second net Account number is registered on standing and is logged in, and could be accessed to the second website；Different-place login detection is carried out to the account number on the second website, It needs to complete to log according to the requirement input identifying code of the second website；During login account, the second website can also Logging in network address detected is carried out to the account of login, limitation uses the access of the different accounts of same network address；And limit Make the access times etc. of each account in the given time.

In order to stablize and enduringly crawl the webpage of the second website, need to complete to climb under the limitation according to the second website Extract operation.

In an alternative embodiment, it crawls operation multithreading may be used and crawl, each thread in multithreading Using an account number and bind a fixed proxy network address, wherein each account and each proxy network address are one by one It is corresponding, hide the detection that the second website uses different accounts same network address such case, allows to crawl operation and continue It carries out.

As an optional embodiment, it is pre-configured with the access speed that each account accesses the second website, hides second Website, in the limitation of unit interval access times, allows to crawl operation persistently progress to account.

It is alternatively possible to the access interval that account accesses the second website be pre-set, for example, account can be arranged every 3- Second website was once accessed in 6 seconds.

Optionally, the webpage of the second website and the mobile terminal webpage that can be the second website are crawled.The mobile terminal net of website Page is typically the webpage simplified in the case where not changing the content of webpage, and access institute can be reduced by accessing website mobile terminal webpage The flow needed can consume less net by being crawled to mobile terminal webpage in the case where not influencing to crawl accuracy Network flow, to accelerate the speed crawled.

In an alternative embodiment, it may may require that input identifying code in the second website log account, wherein input The process of identifying code can be manually entered identifying code on the webpage for executing the web crawlers for crawling operation, complete to log in； Identifying code can also be automatically entered by picture recognition algorithm routine, complete logon operation.

Fig. 3 is a kind of schematic diagram of web page crawl device according to the ... of the embodiment of the present invention, as shown in Fig. 2, the device includes： Interception unit 31, for intercepting the access request for accessing the first website from the first website, wherein the source of access request is second Website；Acquiring unit 33, the information for obtaining the second website from the uniform resource locator in source, wherein the second website Information include account information of the user to be crawled in the second website；Crawl unit 35, for according to the information of the second website from Webpage is crawled on second website.

In the above-described embodiments, by interception unit, the access first that source is the second website is intercepted from the first website The access request of website, then by acquiring unit, packet is obtained from the uniform resource locator in the source of above-mentioned access request Include the information for waiting crawling the second website for the account in the second website, and using the information of the second website of acquisition as crawling Entrance, webpage is crawled from the second website by crawling unit, the content of analyzing web page can accurately obtain website and webpage Data, solve the technical issues of can not accurately obtaining web data.

In an alternative embodiment, crawling unit includes：Login module, for by crawling thread in multithreading Use predetermined account to log in the second website, wherein multithreading include it is multiple crawl thread, each crawl thread correspond to one it is predetermined Account；Module is crawled, for after logging in the second website using predetermined account, using the account of the second website from the second website On crawl webpage.

As an optional embodiment, crawling module includes：First acquisition module, for obtaining preconfigured limitation Information；Control module crawls thread and from the second website crawls webpage according to the access speed in restricted information for controlling.

Optionally, login module further includes：Binding module, for for the per thread in multithreading bind one it is fixed Network address.

As an optional embodiment, in the case where logging in the second website and needing identifying code, login module include with It is at least one lower：Predetermined authentication module logs in the second website for inputting identifying code according to predetermined manner using predetermined account；From Dynamic authentication module is identified the identifying code in picture, and for obtaining the identifying code occurred with graphic form according to identification The identifying code gone out logs in the second website using predetermined account.

In an alternative embodiment, automatic authentication module includes：Identification module is used for according to data model to picture In identifying code be identified, wherein data model trains to obtain according to multiple training datas, and training data includes：In advance The identifying code picture of the second website first got identifying code corresponding with the identifying code picture.

An optional embodiment, identification module include：Second acquisition module, for obtaining the letter of multiple features in picture Breath, wherein characteristic information is used to distinguish the background of identifying code and picture；Identification submodule includes according to multiple characteristic informations to figure Identifying code in piece is identified.

The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.

In the above embodiment of the present invention, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment The part of detailed description may refer to the associated description of other embodiment.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, for example, the unit division, Ke Yiwei A kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module It connects, can be electrical or other forms.

The unit illustrated as separating component may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple On unit.Some or all of unit therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes：USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of web page crawl method, which is characterized in that including：

From the first website, interception accesses the access request of first website, wherein the source of the access request is the second net It stands；

The information of second website is obtained from the uniform resource locator in the source, wherein the letter of second website Breath includes account information of the user to be crawled in second website；

According to the information of second website webpage is crawled from second website.

2. according to the method described in claim 1, it is characterized in that, crawling webpage from second website according to described information Including：

Second website is logged in using predetermined account, wherein the multithreading includes more by the thread that crawls in multithreading Thread is crawled described in a, each thread that crawls corresponds to a predetermined account；

After logging in second website using predetermined account, using the account of second website from second website Crawl webpage.

3. according to the method described in claim 2, it is characterized in that, using the account of second website from second website On crawl webpage and include：

Obtain preconfigured restricted information；

Thread is crawled described in control, and webpage is crawled from second website according to the access speed in the restricted information.

4. according to the method described in claim 2, it is characterized in that, being stepped on using predetermined account by the thread that crawls in multithreading Recording second website further includes：

A fixed network address is bound for the per thread in the multithreading.

5. according to the method described in claim 2, it is characterized in that, logging in the case where second website needs identifying code Under, it includes at least one of to log in second website using the predetermined account：

Identifying code, which is inputted, according to predetermined manner logs in second website using the predetermined account；

The identifying code that occurs with graphic form is obtained, the identifying code in the picture is identified, and according to identifying The identifying code log in second website using the predetermined account.

6. according to the method described in claim 5, it is characterized in that, to the identifying code in the picture be identified including：

The identifying code in the picture is identified according to data model, wherein the data model is according to multiple training What data were trained, the training data includes：The identifying code picture of second website got in advance and the verification The corresponding identifying code of code picture.

7. according to the method described in claim 5, it is characterized in that, to the identifying code in the picture be identified including：

Obtain multiple characteristic informations in the picture, wherein the characteristic information is for distinguishing the identifying code and the figure The background of piece；

The identifying code in the picture is identified according to the multiple characteristic information.

8. a kind of web page crawl device, which is characterized in that including：

Interception unit, the access request for accessing first website from the interception of the first website, wherein the access request Source is the second website；

Acquiring unit, the information for obtaining second website from the uniform resource locator in the source, wherein described The information of second website includes account information of the user to be crawled in second website；

Unit is crawled, for crawling webpage from second website according to the information of second website.

9. device according to claim 8, which is characterized in that the unit that crawls includes：

Login module, for logging in second website using predetermined account by the thread that crawls in multithreading, wherein described Multithreading include it is multiple it is described crawl thread, each thread that crawls corresponds to a predetermined account；

Crawl module, for after logging in second website using predetermined account, using second website account from Webpage is crawled on second website.

10. device according to claim 9, which is characterized in that the module that crawls includes：

First acquisition module, for obtaining preconfigured restricted information；

Control module described crawls thread according to the access speed in the restricted information from second website for controlling Crawl webpage.