CN108446287A - Web page crawl method and device - Google Patents
Web page crawl method and device Download PDFInfo
- Publication number
- CN108446287A CN108446287A CN201710085587.4A CN201710085587A CN108446287A CN 108446287 A CN108446287 A CN 108446287A CN 201710085587 A CN201710085587 A CN 201710085587A CN 108446287 A CN108446287 A CN 108446287A
- Authority
- CN
- China
- Prior art keywords
- website
- identifying code
- account
- information
- webpage
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of web page crawl method and devices.Wherein, this method includes:The access request for accessing the first website is intercepted from the first website, wherein the source of access request is the second website;The information of the second website is obtained from the uniform resource locator in source, wherein the information of the second website includes account of the user to be crawled in the second website;According to the information of the second website webpage is crawled from the second website.The present invention solves the technical issues of can not accurately obtaining web data.
Description
Technical field
The present invention relates to Web Page Processing fields, in particular to a kind of web page crawl method and device.
Background technology
Microblogging occupies consequence in the social platform of China Internet, possesses the user base of huge quantity,
Also the numerous data information of same number is brought, monitoring is carried out to numerous data in microblogging in order to carry out, needs to obtain micro-
The data of webpage where rich, but in the prior art, as website monitoring data acquisition side, due to the webpage on microblogging website
Data are larger, and to which the difficulty for obtaining web data is larger, and the data obtained are also inaccurate.
For above-mentioned the problem of can not accurately obtaining web data, currently no effective solution has been proposed.
Invention content
An embodiment of the present invention provides a kind of web page crawl method and devices, at least to solve accurately obtain webpage
The technical issues of data.
One side according to the ... of the embodiment of the present invention, providing a kind of web page crawl method includes:It is intercepted from the first website
Access the access request of first website, wherein the source of the access request is the second website;Unification from the source
The information of second website is obtained in Resource Locator, wherein the information of second website includes user to be crawled in institute
State the account information of the second website;According to the information of second website webpage is crawled from second website.
Further, crawling webpage from second website according to described information includes:Pass through crawling in multithreading
Thread logs in second website using predetermined account, wherein the multithreading include it is multiple it is described crawl thread, it is each described
It crawls thread and corresponds to a predetermined account;After logging in second website using predetermined account, described second is used
The account of website crawls webpage from second website.
Further, crawling webpage from second website using the account of second website includes:It obtains advance
The restricted information of configuration;Thread is crawled described in control to be swashed from second website according to the access speed in the restricted information
Take webpage.
Further, further include using predetermined account login second website by the thread that crawls in multithreading:For
Per thread in the multithreading binds a fixed network address.
Further, in the case where login second website needs identifying code, the predetermined account is used to log in institute
It includes at least one of to state the second website:Identifying code, which is inputted, according to predetermined manner logs in described second using the predetermined account
Website;The identifying code that occurs with graphic form is obtained, the identifying code in the picture is identified, and according to identifying
The identifying code log in second website using the predetermined account.
Further, to the identifying code in the picture be identified including:According to data model in the picture
Identifying code is identified, wherein the data model trains to obtain according to multiple training datas, the training data packet
It includes:The identifying code picture identifying code corresponding with the identifying code picture of second website got in advance.
Further, to the identifying code in the picture be identified including:Obtain multiple features letter in the picture
Breath, wherein the characteristic information is used to distinguish the background of the identifying code and the picture;According to the multiple characteristic information pair
Identifying code in the picture is identified.
Other side according to the ... of the embodiment of the present invention provides a kind of web page crawl device, including:Interception unit is used
In the access request for accessing first website from the interception of the first website, wherein the source of the access request is the second website;
Acquiring unit, the information for obtaining second website from the uniform resource locator in the source, wherein described second
The information of website includes account information of the user to be crawled in second website;Unit is crawled, for according to second net
The information stood crawls webpage from second website.
Further, the unit that crawls includes:Login module, for being made a reservation for by the thread use that crawls in multithreading
Account logs in second website, wherein the multithreading include it is multiple it is described crawl thread, it is each described to crawl thread correspondence
One predetermined account;Module is crawled, for after logging in second website using predetermined account, using described second
The account of website crawls webpage from second website.
Further, the module that crawls includes:First acquisition module, for obtaining preconfigured restricted information;Control
Molding block described crawl thread and from second website crawls net according to the access speed in the restricted information for controlling
Page.
Further, the login module further includes:Binding module, for being bound for the per thread in the multithreading
One fixed network address.
Further, in the case where logging in second website and needing identifying code, the login module include with down toward
It is one of few:Predetermined authentication module logs in second net for inputting identifying code according to predetermined manner using the predetermined account
It stands;Automatic authentication module knows the identifying code in the picture for obtaining the identifying code occurred with graphic form
Not, and according to the identifying code identified using the predetermined account second website is logged in.
Further, automatic authentication module includes:Identification module is used for according to data model to the verification in the picture
Code is identified, wherein the data model trains to obtain according to multiple training datas, and the training data includes:In advance
The identifying code picture identifying code corresponding with the identifying code picture of second website first got.
Further, the identification module includes:Second acquisition module, for obtaining the letter of multiple features in the picture
Breath, wherein the characteristic information is used to distinguish the background of the identifying code and the picture;Identify that submodule includes according to
The identifying code in the picture is identified in multiple characteristic informations.
In embodiments of the present invention, the access that the first website of access that source is the second website is intercepted from the first website is asked
It asks, then obtains the information of the second website from the uniform resource locator in the source of above-mentioned access request, and with the of acquisition
The information of two websites crawls webpage as the entrance crawled from the second website, and the content of analyzing web page can be obtained accurately
The data of website and webpage solve the technical issues of can not accurately obtaining web data.
Description of the drawings
Attached drawing described herein is used to provide further understanding of the present invention, and is constituted part of this application, this hair
Bright illustrative embodiments and their description are not constituted improper limitations of the present invention for explaining the present invention.In the accompanying drawings:
Fig. 1 is a kind of flow chart of web page crawl method according to the ... of the embodiment of the present invention;
Fig. 2 is a kind of flow chart of optional web page crawl method according to the ... of the embodiment of the present invention;And
Fig. 3 is a kind of schematic diagram of web page crawl device according to the ... of the embodiment of the present invention.
Specific implementation mode
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The every other embodiment that member is obtained without making creative work should all belong to the model that the present invention protects
It encloses.
It should be noted that term " first " in description and claims of this specification and above-mentioned attached drawing, "
Two " etc. be for distinguishing similar object, without being used to describe specific sequence or precedence.It should be appreciated that using in this way
Data can be interchanged in the appropriate case, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " comprising " and " having " and their any deformation, it is intended that cover
It includes to be not necessarily limited to for example, containing the process of series of steps or unit, method, system, product or equipment to cover non-exclusive
Those of clearly list step or unit, but may include not listing clearly or for these processes, method, product
Or the other steps or unit that equipment is intrinsic.
According to embodiments of the present invention, a kind of web page crawl embodiment of the method is provided, it should be noted that in the stream of attached drawing
The step of journey illustrates can execute in the computer system of such as a group of computer-executable instructions, although also, flowing
Logical order is shown in journey figure, but in some cases, it can be with different from shown or described by sequence execution herein
The step of.
Fig. 1 is a kind of flow chart of web page crawl method according to the ... of the embodiment of the present invention, as shown in Figure 1, this method includes
Following steps:
Step S102 intercepts the access request for accessing the first website from the first website, wherein the source of access request is the
Two websites;
Step S104 obtains the information of the second website from the uniform resource locator in source, wherein the letter of the second website
Breath includes account information of the user to be crawled in the second website;
Step S106 crawls webpage according to the information of the second website from the second website.
In the above-described embodiments, the access that the first website of access that source is the second website is intercepted from the first website is asked
It asks, then obtains the information of the second website from the uniform resource locator in the source of above-mentioned access request, and with the of acquisition
The information of two websites crawls webpage as the entrance crawled from the second website, and the content of analyzing web page can be obtained accurately
The data of website and webpage solve the technical issues of can not accurately obtaining web data.
Optionally, by uniform resource locator, the first website visiting of publication access on the second website can be obtained and asked
Seek the account of information.For example, some account on the second website has issued the chain that can access the first website on the second website
It is grounded location, the i.e. uniform resource locator of the first website, other accounts on the second website is allow to be sent out by the chained address
Go out to access the access request of the first website.To according to the access request for accessing the first webpage, can reversely be analyzed, be obtained
The account information for issuing the second website of the chained address of the first website, convenient for subsequently crawling the operation of the account place page,
It can be according to the account information of the interpretation of result crawled second website.
Optionally, obtaining the access request of the first website can carry out in the server end of the first website, from the first website
Server end interception access whole requests of the first website, and it is the second website to extract source from whole requests of acquisition
Part access request.
Optionally, the data that webpage can use web crawlers to crawl the second website and webpage are crawled on the second website,
In, web crawlers is a kind of according to certain rule, automatically captures the program or script of webpage information on the second website.It crawls
Process can find webpage in the chained address of webpage by web crawlers, since some webpage of website read webpage
Content, find other chained addresses in webpage, next webpage then found by the chained address, and continue cycling through down
It goes, until all webpages in this website have all been captured.
As an optional embodiment, crawling webpage from the second website according to information includes:By in multithreading
Crawl thread using predetermined account log in the second website, wherein multithreading include it is multiple crawl thread, each crawl thread correspondence
One predetermined account;After logging in the second website using predetermined account, swashed from the second website using the account of the second website
Take webpage.
Specifically, webpage is crawled from the second website according to the information of the second website of acquisition, can be climbed using multithreading
The pattern taken, wherein the thread that crawls in multithreading logs in the second website, and each line in multithreading using predetermined account
Journey corresponds to an account respectively, and crawling for webpage can be carried out using the corresponding account of each thread in multithreading by crawling operation,
Allow to crawl thread and webpage is crawled from the second website row according to information, is held using the corresponding multiple second websites account of multithreading
Row crawls operation, can obtain webpage information on the second website quickly.
Optionally, the operation of webpage is crawled from the second website according to information using multithreading, wherein can use multi-thread
The corresponding account of per thread in journey is carried out at the same time crawling for webpage, can accelerate to crawl speed.
In an alternative embodiment, crawling webpage from the second website using the account of the second website includes:It obtains
Preconfigured restricted information;Control crawls thread and crawls webpage from the second website according to the access speed in restricted information.
Specifically, the process of webpage is crawled from the second website, can first obtain it is preconfigured for limit crawl behaviour
The restricted information of work, and the access speed that corresponding the second website of account pair of thread accesses is crawled according to restricted information control
Degree, allows to crawl thread and crawls webpage from the second website according to the access speed in restricted information, can hide the second net
It stands to account in the limitation of unit interval access times, allows to crawl operation persistently progress.
Optionally, further include using the second website of predetermined account login by the thread that crawls in multithreading:For multithreading
In per thread bind a fixed network address.
Specifically, a fixed network address is bound to the per thread in multithreading, so that per thread is had independent
Network address equally also makes the account of the second website of login in per thread have independent network address, to hide the
Two websites use different accounts the detection of same network address such case, allow to crawl operation persistently progress.
A kind of optional embodiment logs in the in the case where logging in the second website and needing identifying code using predetermined account
Two websites include at least one of:Identifying code, which is inputted, according to predetermined manner logs in the second website using predetermined account;Obtain with
The identifying code that graphic form occurs, is identified the identifying code in picture, and uses predetermined account according to the identifying code identified
Number log in the second website.
Specifically, during predetermined account logs in the second website, it is understood that there may be predetermined account needs to input identifying code
Predetermined manner input identifying code may be used to complete predetermined account second in these cases in the case where capable of completing to log in
The login process of website, wherein predetermined manner can be manually entered the mode of identifying code;In addition, in these cases, may be used also
By using completing login process of the predetermined account in the second website in a manner of automatic identification identifying code, wherein automatic identification is verified
The mode of code can first obtain the identifying code occurred with graphic form, be identified to the identifying code in picture, and according to knowledge
The identifying code not gone out complete predetermined account the second website login process.By the login mode of above-described embodiment, can make
Predetermined account can log in the second website, and operation completion preparation is crawled to be subsequent.
As an optional embodiment, to the identifying code in picture be identified including:According to data model to picture
In identifying code be identified, wherein data model trains to obtain according to multiple training datas, and training data includes:In advance
The identifying code picture of the second website first got identifying code corresponding with the identifying code picture.
Specifically, the identifying code in the picture of acquisition is identified, the data model being previously obtained can be used to figure
Identifying code in piece is identified, wherein data model can be trained to obtain by multiple training datas, for instructing
Experienced training data can be the identifying code picture of the second website obtained in advance, and corresponding with each identifying code picture test
Demonstrate,prove code.
In an alternative embodiment, to the identifying code in picture be identified including:Obtain multiple spies in picture
Reference ceases, wherein characteristic information is used to distinguish the background of identifying code and picture;According to multiple characteristic informations to the verification in picture
Code is identified.
Specifically, the identifying code in the picture of acquisition is identified, can be believed by the multiple features obtained in picture
Breath, the background of identifying code and picture in picture is distinguished, and is completed according to the characteristic information of identifying code to being tested in picture
Demonstrate,prove the identification of code.
Fig. 2 is a kind of flow chart of optional web page crawl method according to the ... of the embodiment of the present invention, as shown in Fig. 2, the party
Method includes the following steps:
Step S202 logs in the account of the second website;
Step S204 accesses webpage where account to be crawled;
Step S206 crawls the access webpage.
Through the above steps, the account for first logging in the second website reuses the account and accesses webpage where account to be crawled,
And it crawls tool using web crawlers etc. and starts to crawl operation on the webpage of access.
It should be noted that account to be crawled is determined before carrying out crawling operation, pass through the service in the second website
Device end filters out the access request for being directed toward the first website, and the visit is obtained further according to the uniform resource locator in the access request source
It asks the account for asking corresponding second website, and using the account of the second website of acquisition as the entrance crawled, crawls the second net
The webpage stood.
Optionally, in the second website, the access request for accessing the first website can come from multiple accounts in the second website
Number, in order to obtain such account, the server end in the second website is filtered, and the account that will filter out is used as account to be crawled
Number.
It during crawling webpage from the second website, can be limited by the second website, for example, it is desired in the second net
Account number is registered on standing and is logged in, and could be accessed to the second website;Different-place login detection is carried out to the account number on the second website,
It needs to complete to log according to the requirement input identifying code of the second website;During login account, the second website can also
Logging in network address detected is carried out to the account of login, limitation uses the access of the different accounts of same network address;And limit
Make the access times etc. of each account in the given time.
In order to stablize and enduringly crawl the webpage of the second website, need to complete to climb under the limitation according to the second website
Extract operation.
In an alternative embodiment, it crawls operation multithreading may be used and crawl, each thread in multithreading
Using an account number and bind a fixed proxy network address, wherein each account and each proxy network address are one by one
It is corresponding, hide the detection that the second website uses different accounts same network address such case, allows to crawl operation and continue
It carries out.
As an optional embodiment, it is pre-configured with the access speed that each account accesses the second website, hides second
Website, in the limitation of unit interval access times, allows to crawl operation persistently progress to account.
It is alternatively possible to the access interval that account accesses the second website be pre-set, for example, account can be arranged every 3-
Second website was once accessed in 6 seconds.
Optionally, the webpage of the second website and the mobile terminal webpage that can be the second website are crawled.The mobile terminal net of website
Page is typically the webpage simplified in the case where not changing the content of webpage, and access institute can be reduced by accessing website mobile terminal webpage
The flow needed can consume less net by being crawled to mobile terminal webpage in the case where not influencing to crawl accuracy
Network flow, to accelerate the speed crawled.
In an alternative embodiment, it may may require that input identifying code in the second website log account, wherein input
The process of identifying code can be manually entered identifying code on the webpage for executing the web crawlers for crawling operation, complete to log in;
Identifying code can also be automatically entered by picture recognition algorithm routine, complete logon operation.
Fig. 3 is a kind of schematic diagram of web page crawl device according to the ... of the embodiment of the present invention, as shown in Fig. 2, the device includes:
Interception unit 31, for intercepting the access request for accessing the first website from the first website, wherein the source of access request is second
Website;Acquiring unit 33, the information for obtaining the second website from the uniform resource locator in source, wherein the second website
Information include account information of the user to be crawled in the second website;Crawl unit 35, for according to the information of the second website from
Webpage is crawled on second website.
In the above-described embodiments, by interception unit, the access first that source is the second website is intercepted from the first website
The access request of website, then by acquiring unit, packet is obtained from the uniform resource locator in the source of above-mentioned access request
Include the information for waiting crawling the second website for the account in the second website, and using the information of the second website of acquisition as crawling
Entrance, webpage is crawled from the second website by crawling unit, the content of analyzing web page can accurately obtain website and webpage
Data, solve the technical issues of can not accurately obtaining web data.
In an alternative embodiment, crawling unit includes:Login module, for by crawling thread in multithreading
Use predetermined account to log in the second website, wherein multithreading include it is multiple crawl thread, each crawl thread correspond to one it is predetermined
Account;Module is crawled, for after logging in the second website using predetermined account, using the account of the second website from the second website
On crawl webpage.
As an optional embodiment, crawling module includes:First acquisition module, for obtaining preconfigured limitation
Information;Control module crawls thread and from the second website crawls webpage according to the access speed in restricted information for controlling.
Optionally, login module further includes:Binding module, for for the per thread in multithreading bind one it is fixed
Network address.
As an optional embodiment, in the case where logging in the second website and needing identifying code, login module include with
It is at least one lower:Predetermined authentication module logs in the second website for inputting identifying code according to predetermined manner using predetermined account;From
Dynamic authentication module is identified the identifying code in picture, and for obtaining the identifying code occurred with graphic form according to identification
The identifying code gone out logs in the second website using predetermined account.
In an alternative embodiment, automatic authentication module includes:Identification module is used for according to data model to picture
In identifying code be identified, wherein data model trains to obtain according to multiple training datas, and training data includes:In advance
The identifying code picture of the second website first got identifying code corresponding with the identifying code picture.
An optional embodiment, identification module include:Second acquisition module, for obtaining the letter of multiple features in picture
Breath, wherein characteristic information is used to distinguish the background of identifying code and picture;Identification submodule includes according to multiple characteristic informations to figure
Identifying code in piece is identified.
The embodiments of the present invention are for illustration only, can not represent the quality of embodiment.
In the above embodiment of the present invention, all emphasizes particularly on different fields to the description of each embodiment, do not have in some embodiment
The part of detailed description may refer to the associated description of other embodiment.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, for example, the unit division, Ke Yiwei
A kind of division of logic function, formula that in actual implementation, there may be another division manner, such as multiple units or component can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be INDIRECT COUPLING or communication link by some interfaces, unit or module
It connects, can be electrical or other forms.
The unit illustrated as separating component may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, you can be located at a place, or may be distributed over multiple
On unit.Some or all of unit therein can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, it can also
It is that each unit physically exists alone, it can also be during two or more units be integrated in one unit.Above-mentioned integrated list
The form that hardware had both may be used in member is realized, can also be realized in the form of SFU software functional unit.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can be stored in a computer read/write memory medium.Based on this understanding, technical scheme of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or
Part steps.And storage medium above-mentioned includes:USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic disc or CD etc. are various can to store program code
Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of web page crawl method, which is characterized in that including:
From the first website, interception accesses the access request of first website, wherein the source of the access request is the second net
It stands;
The information of second website is obtained from the uniform resource locator in the source, wherein the letter of second website
Breath includes account information of the user to be crawled in second website;
According to the information of second website webpage is crawled from second website.
2. according to the method described in claim 1, it is characterized in that, crawling webpage from second website according to described information
Including:
Second website is logged in using predetermined account, wherein the multithreading includes more by the thread that crawls in multithreading
Thread is crawled described in a, each thread that crawls corresponds to a predetermined account;
After logging in second website using predetermined account, using the account of second website from second website
Crawl webpage.
3. according to the method described in claim 2, it is characterized in that, using the account of second website from second website
On crawl webpage and include:
Obtain preconfigured restricted information;
Thread is crawled described in control, and webpage is crawled from second website according to the access speed in the restricted information.
4. according to the method described in claim 2, it is characterized in that, being stepped on using predetermined account by the thread that crawls in multithreading
Recording second website further includes:
A fixed network address is bound for the per thread in the multithreading.
5. according to the method described in claim 2, it is characterized in that, logging in the case where second website needs identifying code
Under, it includes at least one of to log in second website using the predetermined account:
Identifying code, which is inputted, according to predetermined manner logs in second website using the predetermined account;
The identifying code that occurs with graphic form is obtained, the identifying code in the picture is identified, and according to identifying
The identifying code log in second website using the predetermined account.
6. according to the method described in claim 5, it is characterized in that, to the identifying code in the picture be identified including:
The identifying code in the picture is identified according to data model, wherein the data model is according to multiple training
What data were trained, the training data includes:The identifying code picture of second website got in advance and the verification
The corresponding identifying code of code picture.
7. according to the method described in claim 5, it is characterized in that, to the identifying code in the picture be identified including:
Obtain multiple characteristic informations in the picture, wherein the characteristic information is for distinguishing the identifying code and the figure
The background of piece;
The identifying code in the picture is identified according to the multiple characteristic information.
8. a kind of web page crawl device, which is characterized in that including:
Interception unit, the access request for accessing first website from the interception of the first website, wherein the access request
Source is the second website;
Acquiring unit, the information for obtaining second website from the uniform resource locator in the source, wherein described
The information of second website includes account information of the user to be crawled in second website;
Unit is crawled, for crawling webpage from second website according to the information of second website.
9. device according to claim 8, which is characterized in that the unit that crawls includes:
Login module, for logging in second website using predetermined account by the thread that crawls in multithreading, wherein described
Multithreading include it is multiple it is described crawl thread, each thread that crawls corresponds to a predetermined account;
Crawl module, for after logging in second website using predetermined account, using second website account from
Webpage is crawled on second website.
10. device according to claim 9, which is characterized in that the module that crawls includes:
First acquisition module, for obtaining preconfigured restricted information;
Control module described crawls thread according to the access speed in the restricted information from second website for controlling
Crawl webpage.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710085587.4A CN108446287A (en) | 2017-02-16 | 2017-02-16 | Web page crawl method and device |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201710085587.4A CN108446287A (en) | 2017-02-16 | 2017-02-16 | Web page crawl method and device |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN108446287A true CN108446287A (en) | 2018-08-24 |
Family
ID=63190769
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201710085587.4A Pending CN108446287A (en) | 2017-02-16 | 2017-02-16 | Web page crawl method and device |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN108446287A (en) |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112836108A (en) * | 2021-01-29 | 2021-05-25 | 宝宝巴士股份有限公司 | Method and terminal for crawling third-party website data |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070208744A1 (en) * | 2006-03-01 | 2007-09-06 | Oracle International Corporation | Flexible Authentication Framework |
| CN103533097A (en) * | 2013-10-10 | 2014-01-22 | 北京京东尚科信息技术有限公司 | Web crawler downloading and analyzing method and device |
| US20150339682A1 (en) * | 2012-07-25 | 2015-11-26 | Indix Corporation | Adaptive gathering of structured and unstructured data system and method |
| CN105260447A (en) * | 2015-10-09 | 2016-01-20 | 上海瀚之友信息技术服务有限公司 | Webpage data analysis method and system |
| CN105279272A (en) * | 2015-10-30 | 2016-01-27 | 南京未来网络产业创新有限公司 | Content aggregation method based on distributed web crawlers |
| CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
| US20160277429A1 (en) * | 2014-03-28 | 2016-09-22 | Amazon Technologies, Inc. | Token based automated agent detection |
| CN106033579A (en) * | 2015-03-16 | 2016-10-19 | 北京国双科技有限公司 | Data processing method and apparatus thereof |
-
2017
- 2017-02-16 CN CN201710085587.4A patent/CN108446287A/en active Pending
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20070208744A1 (en) * | 2006-03-01 | 2007-09-06 | Oracle International Corporation | Flexible Authentication Framework |
| US20150339682A1 (en) * | 2012-07-25 | 2015-11-26 | Indix Corporation | Adaptive gathering of structured and unstructured data system and method |
| CN103533097A (en) * | 2013-10-10 | 2014-01-22 | 北京京东尚科信息技术有限公司 | Web crawler downloading and analyzing method and device |
| US20160277429A1 (en) * | 2014-03-28 | 2016-09-22 | Amazon Technologies, Inc. | Token based automated agent detection |
| CN106033579A (en) * | 2015-03-16 | 2016-10-19 | 北京国双科技有限公司 | Data processing method and apparatus thereof |
| CN105260447A (en) * | 2015-10-09 | 2016-01-20 | 上海瀚之友信息技术服务有限公司 | Webpage data analysis method and system |
| CN105279272A (en) * | 2015-10-30 | 2016-01-27 | 南京未来网络产业创新有限公司 | Content aggregation method based on distributed web crawlers |
| CN105956175A (en) * | 2016-05-24 | 2016-09-21 | 考拉征信服务有限公司 | Webpage content crawling method and device |
Non-Patent Citations (4)
| Title |
|---|
| 扶宇琳: "WeiboInfo_一个基于时间轴的微博可视化及总结原型系统", 《中国优秀硕士学位论文全文数据库信息科技辑》 * |
| 李冠辰等: ""一个基于 hadoop 的并行社交网络挖掘系统"", 《SOFTWARE》 * |
| 杨济运等: "基于协程模型的分布式爬虫框架"", 《计算技术与自动化》 * |
| 陈利婷: "大数据时代的反爬虫技术"", 《电脑与信息技术》 * |
Cited By (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112836108A (en) * | 2021-01-29 | 2021-05-25 | 宝宝巴士股份有限公司 | Method and terminal for crawling third-party website data |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109246076A (en) | A kind of method and apparatus of single-sign-on multisystem | |
| CN106776989A (en) | A kind of info web methods of exhibiting and device | |
| US20150302466A1 (en) | Data determination method and device for a thermodynamic chart | |
| JP2014006898A5 (en) | How to predict call topics | |
| CN110020512A (en) | A kind of method, apparatus, equipment and the storage medium of anti-crawler | |
| CN106681926A (en) | Method and device for testing webpage performances | |
| CN109450879A (en) | User access activity monitoring method, electronic device and computer readable storage medium | |
| CN112464250A (en) | Method, device and medium for automatically detecting unauthorized vulnerability | |
| CN108090091A (en) | Web page crawl method and apparatus | |
| CN104270391B (en) | A kind of processing method and processing device of access request | |
| CN108268635A (en) | For obtaining the method and apparatus of data | |
| WO2017080454A1 (en) | Website access path aggregation method and device | |
| CN110427971A (en) | Recognition methods, device, server and the storage medium of user and IP | |
| JP2021082309A (en) | Method and device for setting password protection question | |
| CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
| CN107104924A (en) | The verification method and device of website backdoor file | |
| CN110535974A (en) | Method for pushing, driving means, equipment and the storage medium of resource to be put | |
| CN106933916A (en) | The processing method and processing device of JSON character strings | |
| CN104504004B (en) | The sharing method and device shared for website | |
| CN112989158A (en) | Method, device and storage medium for identifying webpage crawler behavior | |
| CN108446287A (en) | Web page crawl method and device | |
| CN108776943B (en) | Data transmission method and device, storage medium and electronic device | |
| CN106528640B (en) | A kind of finger-mark check method and system based on browser | |
| CN114065092A (en) | Website identification method, device, computer equipment and storage medium | |
| CN106933885A (en) | The acquisition methods and device of website propagating influence |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB02 | Change of applicant information | ||
| CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
| RJ01 | Rejection of invention patent application after publication | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180824 |