CN110020041B - Method and device for tracking crawling process - Google Patents
Method and device for tracking crawling process Download PDFInfo
- Publication number
- CN110020041B CN110020041B CN201710719691.4A CN201710719691A CN110020041B CN 110020041 B CN110020041 B CN 110020041B CN 201710719691 A CN201710719691 A CN 201710719691A CN 110020041 B CN110020041 B CN 110020041B
- Authority
- CN
- China
- Prior art keywords
- website
- crawling
- task
- crawled
- website address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method for tracking a crawling process, which comprises the following steps: storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; polling the website addresses in the crawling website list, and executing the following operations when polling to one website address respectively: according to the polled website address, inquiring to obtain a crawling task ID of the polled website address from the crawling website list; according to the crawling task ID, inquiring error information corresponding to the crawling task ID from a database; the database is used for storing the crawling result of the crawler module and error information in the crawling process. By adopting the method, the automatic tracking and crawling process can be realized, and the error information in the crawling process can be inquired.
Description
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for tracking a crawling process.
Background
A web crawler is a program or script that automatically captures internet information according to certain rules. When a web crawler crawls a website, the crawling process of the web crawler needs to be tracked at any time so as to find problems in the crawling process in time.
The network crawler is composed of a plurality of sub-modules with different logic functions, and when an error condition occurs in the crawling process of each sub-module, error information is automatically sent to the database. Technicians can acquire error information in the crawling process by manually inquiring the database, and the crawling process of the crawler is tracked. Because the number of the sub-modules of the web crawler is large, the crawling speed of the web crawler is high, the crawling duration is long, for a technician, the work of continuously querying error information of each sub-module of the crawler from the database is tedious and energy-consuming, and the efficiency of manually querying the database is not high.
Disclosure of Invention
In view of the above, the present invention has been developed to provide a method of tracking a crawling process that overcomes, or at least partially solves, the above-mentioned problems. By adopting the method, the crawling process of the crawler can be automatically tracked, and the error information of the crawling process can be acquired.
The invention provides a method for tracking a crawling process in a first aspect, which comprises the following steps: storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; polling the website addresses in the crawling website list, and executing the following operations when polling to one website address respectively: according to the polled website address, inquiring to obtain a crawling task ID of the polled website address from the crawling website list; according to the crawling task ID, inquiring error information corresponding to the crawling task ID from a database; the database is used for storing the crawling result of the crawler module and error information in the crawling process. By adopting the scheme, the database is queried through the program, the error information corresponding to the crawling task ID is obtained, the automatic query of the error information corresponding to the crawling task ID is realized, and the automatic tracking crawling process is realized.
A second aspect of the present invention provides an apparatus for tracking a crawling process, comprising: the data storage unit is used for storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; the polling processing unit is used for polling the website addresses in the crawling website list; the first query unit is used for querying and obtaining a crawling task ID of the polled website address from the crawling website list according to the polled website address; the second query unit is used for querying a database to obtain error information corresponding to the crawling task ID according to the crawling task ID; the database is used for storing the crawling result of the crawler module and error information in the crawling process.
In one implementation, before storing the website address of the website to be crawled, which is sent to the crawler module, in the crawled website list, the method further includes: and sending the website address of the website to be crawled to a crawler module, so that the crawler module crawls information from the website address of the website to be crawled.
In one implementation, before querying a crawling task ID for crawling the polled website address from the crawling website list according to the polled website address, the method further includes: judging whether a time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches a set time length or not; and if the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches the set time length, inquiring to obtain a crawling task ID of the polled website address from the crawling website list according to the polled website address.
In one implementation, the method further comprises: according to the crawling task ID, the crawling amount of the website address which is crawled to is obtained by inquiring in a database; and comparing the crawling quantity of the polled website address with a set crawling quantity, and judging whether the crawling quantity of the polled website address is abnormal or not.
In one implementation, the method further comprises: according to the error information corresponding to the crawling task ID, respectively inquiring an error reason corresponding to each piece of error information and a suggested modification scheme corresponding to each piece of error information from an experience database; and respectively sending the error information corresponding to the crawling task ID, the error reason corresponding to each piece of error information and the suggested modification scheme corresponding to each piece of error information to a person with set responsibility.
By means of the technical scheme, the method for tracking the crawling process acquires the error information corresponding to the crawling task ID by querying the database through the program, achieves automatic query of the error information corresponding to the crawling task ID, and achieves automatic tracking of the crawling process.
A third aspect of the present invention is directed to a storage medium having stored thereon a program that, when executed, implements the above-described method of tracking a crawling process, and various implementations of the method.
In a fourth aspect, the present invention provides a processor for executing a program, a method for implementing the above tracking and crawling process when the program is executed, and various implementations of the method.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 illustrates a flow diagram of a method of tracking a crawling process provided by an embodiment of the present invention;
FIG. 2 illustrates a flow diagram of another method for tracking a crawling process provided by an embodiment of the present invention;
fig. 3 is a schematic structural diagram of an apparatus for tracking a crawling process according to an embodiment of the present invention.
Detailed Description
The technical scheme of the embodiment of the invention is suitable for an application scene of the crawling process of the automatic tracking crawler.
By adopting the technical scheme of the embodiment of the invention, the crawling process of the crawler can be automatically tracked, the error information in the crawling process can be inquired, the error reason can be analyzed, and the modification suggestion can be given.
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The embodiment of the invention discloses a method for tracking a crawling process, which is shown in a figure 1 and comprises the following steps:
s101, storing the website address of the website to be crawled, which is sent to the crawler module, into a crawling website list;
specifically, the website address of the website to be crawled is stored in a crawl website list, and meanwhile, the time when the website address of the website to be crawled is sent to a crawler module is stored in the crawl website list.
When the website address of the website needing to be crawled is sent to the crawler module, the embodiment of the invention simultaneously generates a crawling task ID for crawling the website address of the website needing to be crawled, and configures the parameter of the crawling task represented by the crawling task ID. Meanwhile, the embodiment of the invention stores the crawling task ID corresponding to the website address of the website to be crawled and the crawling task parameter into the crawling website list.
S102, polling website addresses in the crawling website list;
specifically, according to the sequence from front to back, the embodiments of the present invention sequentially poll the website addresses that are not polled in the crawling list, and when the last website address is polled, return to the header of the crawling website list, and restart polling the website addresses in the table.
When polling to a website address, the following operations are respectively executed:
s103, according to the polled website address, inquiring the crawling task ID corresponding to the polled website address from the crawling website list;
specifically, the website address of each website to be crawled uniquely corresponds to one crawling task ID. When a website address is trained in turn, the crawling task ID corresponding to the website address can be inquired from the crawling website list by using the website address.
And S104, inquiring and obtaining pre-stored error information corresponding to the crawling task ID according to the crawling task ID.
Specifically, after the website address of the website to be crawled is sent to the crawler module, the crawler module crawls the required information from the website. In the crawling process, when a certain sub-module of the crawler module has an error, the crawler module stores the error information in a mode corresponding to the executed crawling task ID. The crawler module stores the error information into a data table or a database, and the error information is stored along with the crawling process and is updated in real time in the crawling process.
All events generated in the process of crawling the website correspond to the crawling task ID of the website. Therefore, all error information corresponding to the crawling task ID can be obtained by querying the crawling task ID of the website.
Therefore, according to the technical scheme of the embodiment, the crawling website list is set, and the program is designed to automatically inquire the error information of the crawling task, so that the automatic tracking of the crawling process of the crawler is realized.
Fig. 2 shows a more detailed implementation of the technical solution of the above embodiment. Referring to fig. 2, the method for tracking a crawling process specifically includes:
s201, sending the website address of the website to be crawled and the information to be crawled to a crawler module;
specifically, the crawler module is composed of a plurality of sub-modules with different logic functions, and each sub-module realizes the function of the crawler module, so that the crawler module can crawl information to be crawled from a website.
Crawling of the website by the crawler module starts with seed injection. The so-called seed injection is to send the Url address of the website to be crawled to the crawler module as a parameter. And after confirming the information needing to be crawled, the crawler module starts to crawl the information needing to be crawled from the website.
S202, after the website address of the website to be crawled is successfully sent to the crawler module, the website address of the website and the time for sending the website address of the website to the crawler module are stored in a crawling website list;
specifically, the crawl website list is a list storing a website address of a website to be crawled, a time when the website address of the website to be crawled is sent to the crawler module, a crawl task ID of the website to be crawled, and parameter information of each crawl task ID.
The crawling task ID is identification information for identifying a task that crawls a website to be crawled. When a website is crawled, a crawling task for crawling the website is created, and the crawling task has unique task identification information, namely a crawling task ID. All events generated in the process of crawling the website correspond to the crawling task ID of the website. It should be noted that the above-mentioned time for creating the crawling task ID can be flexibly arranged according to the actual situation, and can be any feasible time node when the website is crawled. Furthermore, storing the created crawling task ID in the crawling task list is a preferred way of storing the crawling task ID in the embodiment of the present invention, and actually, when the technical solution of the embodiment of the present invention is specifically applied, the crawling task ID may also be stored according to an actual situation as long as effective query can be achieved.
The parameter information of each crawling task ID refers to parameter configuration information set for the crawling task corresponding to the crawling task ID.
When the website address of the website to be crawled is successfully sent to the crawler module, the website address of the website and the time when the website address of the website is sent to the crawler module are stored in the crawling website list. The storage sequence of the website addresses of the websites is the same as the sequence of sending the website addresses of the websites to the crawler module.
If an error occurs when the website address of the website to be crawled is sent to the crawler module, so that the website address of the website to be crawled is not successfully sent to the crawler module, the website address of the website to be crawled is automatically sent again until the website address of the website to be crawled is successfully sent to the crawler module.
S203, the crawler module crawls information to be crawled from the received website address, and in the crawling process, when the sub-module has errors, the error information of the sub-module is stored;
specifically, the error information of the sub-module includes information such as a name of the sub-module, an ID of the crawling task in which the error occurs, and specific error information. The embodiment of the invention adopts data storage media such as a database or a data table to store the error information of the sub-module.
The database or the data table can be set as a database specially used for storing the crawling result and error information in the crawling process. The embodiment of the invention is specifically realized by an ElasticSearch database.
In the process of crawling the information from the website by the crawler module, if an error occurs in any sub-module, the crawler module sends the error information of the sub-module to the database. And the crawler module sends error information to the database once every time an error occurs in the submodule. Similarly, the crawler module continuously stores the crawling result in the database. That is, the storage of the error information and the crawl results is continuously performed along with the crawling process, and is updated in real time. The above process of recording and storing can also be completed in a system log mode.
It should be noted that the process of storing the error information by the crawler module is a process of recording the error information of the crawling process, which is commonly used in the prior art, and the embodiment of the present invention does not improve the process much, and only continues to use the process in the technical scheme to record and store the error information generated in the crawling process, so as to facilitate later use.
S204, polling website addresses in a crawling website list;
specifically, according to the sequence from front to back, the embodiments of the present invention sequentially poll the website addresses that are not polled in the crawling list, and when the last website address is polled, return to the header of the crawling website list, and restart polling the website addresses in the table.
When a website address is polled, executing step S205, and judging whether a time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches a set time length;
specifically, the set time length is the time for crawling the information from the website reserved for the crawler module in the embodiment of the present invention. Generally, after a website is crawled by the crawler module for a period of time, error information occurs, so that after the website address is sent to the crawler module, a period of time is reserved for the crawler module, and the crawler module crawls information from the website.
The specific time length of the set time length is flexibly set by combining the crawling efficiency of the crawler module and the error rate of the crawler module according to actual conditions.
In the embodiment of the present invention, the above-mentioned set time period is set to 2 minutes. When a website address is polled, if the time interval from the moment of sending the website address to the crawler module to the moment of polling the website address reaches 2 minutes, executing the subsequent processing step; if the time is less than 2 minutes, continuing to poll the crawl website list until the time interval from the time of sending the website address to the crawler module to the time of polling the website address reaches 2 minutes when polling a certain website address, and executing subsequent processing steps.
If the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches the set time length, executing the step S206, and inquiring a crawling task ID for crawling the website from a crawling website list according to the website address of the website;
specifically, the crawling task ID is identification information for identifying a task that crawls the website. The embodiment of the invention confirms that a website needs to be crawled, sends the website address of the website needing to be crawled to the crawler module, stores the website address of the website needing to be crawled into the crawling website list, creates a crawling task ID for the event of crawling the website needing to be crawled, and stores the crawling task ID into the crawling task list. In step S206, the embodiment of the present invention acquires a crawling task ID for crawling the website from the crawling website list.
S207, according to the crawling task ID of the website, inquiring to obtain pre-stored error information corresponding to the crawling task ID;
specifically, the error information corresponding to the crawling task ID includes the name of the sub-module in which the error occurs, the error information of the error occurring, and the number of errors occurring in each error sub-module when the crawler module executes the crawling task represented by the crawling task ID.
According to the introduction of step S203, in the process of crawling the information from the website, if an error occurs in any sub-module, the crawler module sends the error information of the sub-module to the database. Thus, in the database, error information is stored for each error. And all events generated in the process of crawling the website correspond to the crawling task ID of the website. Therefore, all error information corresponding to the crawling task ID can be obtained by querying the crawling task ID of the website.
Specifically, in the embodiment of the present invention, all error information (corresponding to the crawling task ID) related to the crawling task ID is queried from the database by an Elasticsearch aggregation query method.
S208, according to the crawling task ID of the crawling website, querying to obtain a pre-stored crawling amount of the crawling website;
specifically, the crawling amount of the crawling website refers to the information amount of information crawled from the website by the crawler module when the crawling task is executed. As can be seen from the introduction of step S203, in the embodiment of the present invention, the crawler module is set to store the crawling result in real time during the crawling process. For example, the crawl results are stored in the Elasticsearch database.
And storing information crawled by a crawler module in a website crawling process in an Elasticissearch database. According to the embodiment of the invention, the crawling task ID is inquired, and the total information amount crawled by the crawler module when the crawling task corresponding to the crawling task ID is executed is obtained in a gathering manner.
S209, comparing the crawling quantity of the website with the set crawling quantity, and judging whether the crawling quantity of the website is the same as the set crawling quantity or not;
specifically, the set crawling amount is a normal value of the task amount required to crawl the website, which is set when a crawling task is initialized.
If the crawling amount of the website is different from the set crawling amount, executing the step S210 to confirm that the crawling amount of the website is abnormal;
specifically, the crawling amount for crawling the website is abnormal, and also belongs to errors occurring in the process of crawling the website.
S211, respectively inquiring and obtaining pre-stored error reasons corresponding to each piece of error information and a suggested modification scheme corresponding to each piece of error information according to the error information corresponding to the crawling task ID;
specifically, the error reasons of the experience and the experience suggested modification schemes are the error reasons of the experience summarized by the technical personnel aiming at the error of the crawling task and the suggested modification schemes of the experience. When the technical personnel troubleshoot the crawling error, the reason of the error and the modification scheme are recorded at the same time for being used particularly. Specifically, the error reasons and the suggested modification schemes can be stored in a database or a data table, and an experience database or an experience data table is obtained.
The experience database or the experience data table is a database or a data table for storing experience error reasons of errors of the crawling task and experience suggested modification schemes. In the embodiment of the invention, when the error information of the crawling task is acquired, the error information is matched with the error information identifier in the experience database or the experience data table, the error reason corresponding to each type of crawling task error information is respectively found, and a modification scheme is suggested.
S212, respectively sending the error information corresponding to the crawling task ID, the error reason corresponding to each piece of error information and the suggested modification scheme corresponding to each piece of error information to set responsibility personnel.
Specifically, the embodiment of the invention respectively sends the error information corresponding to the crawled task ID, the error reasons corresponding to each error information and the suggested modification schemes corresponding to each error information to the electronic mailbox of the set responsible person in the form of an electronic mail.
FIG. 3 is a schematic structural diagram of an apparatus for tracking a crawling process according to an embodiment of the present invention. A device for tracking a crawling process comprises a data storage unit 301, a crawling module and a crawling module, wherein the data storage unit is used for storing website addresses of websites needing to be crawled, which are sent to the crawler module, into a crawling website list; a polling processing unit 302, configured to poll a website address in the crawl website list; a first query unit 303, configured to query, according to the polled website address, a crawling task ID of the polled website address from the crawling website list to obtain the crawling task ID; a second query unit 304, configured to query, according to the crawling task ID, a database to obtain error information corresponding to the crawling task ID; the database is used for storing the crawling result of the crawler module and error information in the crawling process.
Specifically, please refer to the contents of the above method embodiments for the specific working contents of each unit in this embodiment, which are not described herein again.
The device for tracking the crawling process comprises a processor and a memory, wherein the data storage unit, the polling processing unit, the first query unit, the second query unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to one or more, and the crawling process is tracked by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
An embodiment of the present invention provides a storage medium having stored thereon a program that, when executed by a processor, implements the method of tracking a crawling process.
The embodiment of the invention provides a processor, which is used for running a program, wherein the method for tracking and crawling process is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; polling the website addresses in the crawling website list, and executing the following operations when polling to one website address respectively: according to the polled website address, inquiring to obtain a crawling task ID of the polled website address from the crawling website list; according to the crawling task ID, inquiring error information corresponding to the crawling task ID from a database; the database is used for storing the crawling result of the crawler module and error information in the crawling process.
In one implementation, before storing the website address of the website to be crawled, which is sent to the crawler module, in the crawled website list, the method further includes: and sending the website address of the website to be crawled to a crawler module, so that the crawler module crawls information from the website address of the website to be crawled.
In one implementation, before querying a crawling task ID for crawling the polled website address from the crawling website list according to the polled website address, the method further includes: judging whether a time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches a set time length or not; and if the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches the set time length, inquiring to obtain a crawling task ID of the polled website address from the crawling website list according to the polled website address.
In one implementation, the method further comprises: according to the crawling task ID, the crawling amount of the website address which is crawled to is obtained by inquiring in a database; and comparing the crawling quantity of the polled website address with a set crawling quantity, and judging whether the crawling quantity of the polled website address is abnormal or not.
In one implementation, the method further comprises: according to the error information corresponding to the crawling task ID, respectively inquiring an error reason corresponding to each piece of error information and a suggested modification scheme corresponding to each piece of error information from an experience database; and respectively sending the error information corresponding to the crawling task ID, the error reason corresponding to each piece of error information and the suggested modification scheme corresponding to each piece of error information to a person with set responsibility.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program for initializing the following method steps when executed on a data processing device: storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; polling the website addresses in the crawling website list, and executing the following operations when polling to one website address respectively: according to the polled website address, inquiring to obtain a crawling task ID of the polled website address from the crawling website list; according to the crawling task ID, inquiring error information corresponding to the crawling task ID from a database; the database is used for storing the crawling result of the crawler module and error information in the crawling process.
In one implementation, before storing the website address of the website to be crawled, which is sent to the crawler module, in the crawled website list, the method further includes: and sending the website address of the website to be crawled to a crawler module, so that the crawler module crawls information from the website address of the website to be crawled.
In one implementation, before querying a crawling task ID for crawling the polled website address from the crawling website list according to the polled website address, the method further includes: judging whether a time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches a set time length or not; and if the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches the set time length, inquiring to obtain a crawling task ID of the polled website address from the crawling website list according to the polled website address.
In one implementation, the method further comprises: according to the crawling task ID, the crawling amount of the website address which is crawled to is obtained by inquiring in a database; and comparing the crawling quantity of the polled website address with a set crawling quantity, and judging whether the crawling quantity of the polled website address is abnormal or not.
In one implementation, the method further comprises: according to the error information corresponding to the crawling task ID, respectively inquiring an error reason corresponding to each piece of error information and a suggested modification scheme corresponding to each piece of error information from an experience database; and respectively sending the error information corresponding to the crawling task ID, the error reason corresponding to each piece of error information and the suggested modification scheme corresponding to each piece of error information to a person with set responsibility.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.
Claims (10)
1. A method of tracking a crawling process, comprising:
storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; the website address of each website to be crawled uniquely corresponds to a crawling task ID; when a website is crawled, a crawling task for crawling the website is created, the crawling task has unique task identification information, and the task identification information is used as a crawling task ID;
polling the website addresses in the crawling website list, and executing the following operations when polling to one website address respectively:
according to the polled website address, inquiring from the crawling website list to obtain a crawling task ID corresponding to the polled website address; inquiring to obtain pre-stored error information corresponding to the crawling task ID according to the crawling task ID; the pre-storage refers to storing error information in a mode corresponding to the executed crawling task ID when the sub-module of the crawler module has an error in the crawling process.
2. The method of claim 1, wherein prior to storing the website address of the website to be crawled sent to the crawler module in the list of crawled websites, the method further comprises:
and sending the website address of the website to be crawled to a crawler module, so that the crawler module crawls information from the website address of the website to be crawled.
3. The method of claim 2, wherein before querying the crawl task ID corresponding to the polled website address from the crawl website list according to the polled website address, the method further comprises:
judging whether the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches a set time length or not;
and if the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches the set time length, inquiring the crawled website list according to the polled website address to obtain the crawled task ID corresponding to the polled website address.
4. The method of claim 1, further comprising:
inquiring to obtain the pre-stored crawling amount for crawling the polled website address according to the crawling task ID;
and comparing the crawling amount of the polled website address with a set crawling amount, and judging whether the crawling amount of the polled website address is abnormal or not.
5. The method of claim 1, further comprising:
respectively inquiring and obtaining pre-stored error reasons corresponding to each piece of error information and a suggested modification scheme corresponding to each piece of error information according to the error information corresponding to the crawling task ID;
and respectively sending the error information corresponding to the crawling task ID, the error reason corresponding to each piece of error information and the suggested modification scheme corresponding to each piece of error information to a person with set responsibility.
6. An apparatus to track a crawling process, comprising:
the data storage unit is used for storing the website address of the website to be crawled, which is sent to the crawler module, into a crawled website list; the website address of each website to be crawled uniquely corresponds to a crawling task ID; when a website is crawled, a crawling task for crawling the website is created, the crawling task has unique task identification information, and the task identification information is used as a crawling task ID;
the polling processing unit is used for polling the website addresses in the crawling website list;
the first query unit is used for querying the crawling task ID corresponding to the polled website address from the crawling website list according to the polled website address; the second query unit is used for querying according to the crawling task ID to obtain pre-stored error information corresponding to the crawling task ID; the pre-storage refers to storing error information in a mode corresponding to the executed crawling task ID when the sub-module of the crawler module has an error in the crawling process.
7. The apparatus of claim 6, further comprising:
and the data sending unit is used for sending the website address of the website to be crawled to the crawler module, so that the crawler module crawls information from the website address of the website to be crawled.
8. The apparatus of claim 7, further comprising:
the judgment processing unit is used for judging whether the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches a set time length or not;
and if the time interval between the time of sending the website address to the crawler module and the time of polling the website address reaches the set time length, inquiring the crawled website list according to the polled website address to obtain the crawled task ID corresponding to the polled website address.
9. A storage medium, characterized in that the storage medium has stored thereon a program which, when executed by a processor, implements the method of tracking a crawling process of any of claims 1-5.
10. A processor, characterized in that the processor is configured to run a program, which when run implements the method of tracking a crawling process of any of claims 1-5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710719691.4A CN110020041B (en) | 2017-08-21 | 2017-08-21 | Method and device for tracking crawling process |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710719691.4A CN110020041B (en) | 2017-08-21 | 2017-08-21 | Method and device for tracking crawling process |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110020041A CN110020041A (en) | 2019-07-16 |
CN110020041B true CN110020041B (en) | 2021-10-08 |
Family
ID=67186107
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710719691.4A Active CN110020041B (en) | 2017-08-21 | 2017-08-21 | Method and device for tracking crawling process |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110020041B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103856467A (en) * | 2012-12-06 | 2014-06-11 | 百度在线网络技术(北京)有限公司 | Method and distributed system for achieving safety scanning |
CN104063448A (en) * | 2014-06-18 | 2014-09-24 | 华东师范大学 | Distributed type microblog data capturing system related to field of videos |
CN106980691A (en) * | 2017-04-01 | 2017-07-25 | 长沙智擎信息技术有限公司 | A kind of method for auto constructing in on-line teaching resources storehouse |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7680785B2 (en) * | 2005-03-25 | 2010-03-16 | Microsoft Corporation | Systems and methods for inferring uniform resource locator (URL) normalization rules |
US9405831B2 (en) * | 2008-04-16 | 2016-08-02 | Gary Stephen Shuster | Avoiding masked web page content indexing errors for search engines |
CN102902764B (en) * | 2012-09-25 | 2016-05-11 | 北京奇虎科技有限公司 | Method and device for logging |
CN103559219B (en) * | 2013-10-18 | 2016-12-07 | 北京京东尚科信息技术有限公司 | Distributed network crawler capturing method for scheduling task, dispatching terminal equipment and crawl node |
CN104199819B (en) * | 2014-07-03 | 2017-10-17 | 北京思特奇信息技术股份有限公司 | A kind of WEB system mistakes processing method and processing device |
CN104182462B (en) * | 2014-07-21 | 2018-06-26 | 安徽华贞信息科技有限公司 | A kind of web crawlers service system for room library net |
CN106339379B (en) * | 2015-07-07 | 2019-08-16 | 阿里巴巴集团控股有限公司 | Website running state monitoring method and device |
CN106557334B (en) * | 2015-09-25 | 2020-02-07 | 北京国双科技有限公司 | Method and device for judging completion of crawler task |
CN106648839B (en) * | 2015-10-30 | 2020-06-05 | 北京国双科技有限公司 | Data processing method and device |
CN106649362B (en) * | 2015-10-30 | 2020-02-07 | 北京国双科技有限公司 | Webpage crawling method and device |
CN106649455B (en) * | 2016-09-24 | 2021-01-12 | 孙燕群 | Standardized system classification and command set system for big data development |
CN106776744A (en) * | 2016-11-21 | 2017-05-31 | 中国软件与技术服务股份有限公司 | A kind of software development methodology and system based on internet information |
-
2017
- 2017-08-21 CN CN201710719691.4A patent/CN110020041B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103856467A (en) * | 2012-12-06 | 2014-06-11 | 百度在线网络技术(北京)有限公司 | Method and distributed system for achieving safety scanning |
CN104063448A (en) * | 2014-06-18 | 2014-09-24 | 华东师范大学 | Distributed type microblog data capturing system related to field of videos |
CN106980691A (en) * | 2017-04-01 | 2017-07-25 | 长沙智擎信息技术有限公司 | A kind of method for auto constructing in on-line teaching resources storehouse |
Non-Patent Citations (2)
Title |
---|
An evaluating method of spider detection techniques by trap;Fan Chunlong 等;《2010 2nd International Conference on Future Computer and Communication》;20100524;V1-823-V1-826 * |
基于协程模型的分布式爬虫框架;杨济运 等;《计算技术与自动化》;20140930;第33卷(第3期);126-133 * |
Also Published As
Publication number | Publication date |
---|---|
CN110020041A (en) | 2019-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110971571A (en) | Website domain name verification method and related device | |
CN110489315B (en) | Operation request tracking method, operation request tracking device and server | |
CN110781372B (en) | Method and device for optimizing website, computer equipment and storage medium | |
CN110020339B (en) | Webpage data acquisition method and device based on non-buried point | |
CN104239353B (en) | WEB classification control and log audit method | |
CN103152391A (en) | Journal output method and device | |
CN110858166A (en) | Application exception processing method and device, storage medium and processor | |
CN110099074B (en) | Anomaly detection method and system for Internet of things equipment and electronic equipment | |
CN111222592B (en) | Method and device for acquiring two-dimensional code of equipment | |
CN106648839B (en) | Data processing method and device | |
CN112583944B (en) | Processing method and device for updating domain name certificate | |
CN110020041B (en) | Method and device for tracking crawling process | |
CN108897873B (en) | Method and device for generating job file, storage medium and processor | |
US8161013B2 (en) | Implementing application specific management policies on a content addressed storage device | |
CN109597743B (en) | Page circling method, click rate statistical method and related equipment | |
CN107948234A (en) | The processing method and processing device of data | |
CN111291127B (en) | Data synchronization method, device, server and storage medium | |
CN106611118B (en) | Method and device for applying login credentials | |
CN108228613B (en) | Data reading method and device | |
CN114710392B (en) | Event information acquisition method and device | |
CN110020348B (en) | Early warning method and device for circled events | |
CN110968754B (en) | Detection method and device for crawler page turning strategy | |
CN109426559B (en) | Command issuing method and device, storage medium and processor | |
CN110852743A (en) | Data acquisition method and device | |
CN110851822A (en) | Network download security processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Beijing city Haidian District Shuangyushu Area No. 76 Zhichun Road cuigongfandian 8 layer A Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |