CN111130900A - Data acquisition method and device based on distributed interconnection of coordination services - Google Patents
Data acquisition method and device based on distributed interconnection of coordination services Download PDFInfo
- Publication number
- CN111130900A CN111130900A CN201911399135.9A CN201911399135A CN111130900A CN 111130900 A CN111130900 A CN 111130900A CN 201911399135 A CN201911399135 A CN 201911399135A CN 111130900 A CN111130900 A CN 111130900A
- Authority
- CN
- China
- Prior art keywords
- acquisition
- configuration
- program
- data
- collection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 68
- 238000003860 storage Methods 0.000 claims abstract description 16
- 230000004044 response Effects 0.000 claims description 29
- 230000008707 rearrangement Effects 0.000 claims description 6
- 230000000977 initiatory effect Effects 0.000 claims description 3
- 238000013480 data collection Methods 0.000 abstract description 7
- 230000003111 delayed effect Effects 0.000 abstract description 6
- 238000007726 management method Methods 0.000 description 35
- 239000003795 chemical substances by application Substances 0.000 description 13
- 238000010586 diagram Methods 0.000 description 13
- 238000012544 monitoring process Methods 0.000 description 12
- 230000006870 function Effects 0.000 description 10
- 230000008569 process Effects 0.000 description 9
- 238000004590 computer program Methods 0.000 description 8
- 230000014509 gene expression Effects 0.000 description 7
- 230000000694 effects Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 238000005516 engineering process Methods 0.000 description 3
- 230000006872 improvement Effects 0.000 description 3
- VYZAMTAEIAYCRO-UHFFFAOYSA-N Chromium Chemical compound [Cr] VYZAMTAEIAYCRO-UHFFFAOYSA-N 0.000 description 2
- 230000005540 biological transmission Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000013500 data storage Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000004088 simulation Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 1
- 230000009193 crawling Effects 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L41/00—Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
- H04L41/08—Configuration management of networks or network elements
- H04L41/0803—Configuration setting
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L67/00—Network arrangements or protocols for supporting network services or applications
- H04L67/01—Protocols
- H04L67/10—Protocols in which an application is distributed across nodes in the network
Landscapes
- Engineering & Computer Science (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The application provides a distributed interconnected data acquisition method and device based on coordination service, a storage medium and a processor. The data acquisition method comprises the following steps: determining the type of the collected data; configuring an acquisition rule according to the type of the acquired data; determining an acquisition program according to an acquisition rule; the distributed interconnected collection can be supported by collecting data according to the collection rule and the determined collection program, the complexity of data collection is reduced, personnel without relevant bases can configure relevant collection programs, the use range is enlarged, the state of the collection programs can be monitored in time, the condition that the collection programs are not found in a delayed mode is avoided, the collection programs are managed and monitored, the collection programs are managed and controlled in an oriented mode, the cost is reduced, and the collection efficiency is improved.
Description
Technical Field
The present application relates to data mining, and in particular, to a method and an apparatus for distributed interconnected data acquisition based on a coordination service, a storage medium, and a processor.
Background
At present, the open source acquisition programs popular on the network, such as webmagic, nutche and the like, have insufficient support on distribution, complex acquisition strategies, high data acquisition complexity and low efficiency; the collection needs to be coded, and non-developers cannot use the collection; a uniform management monitoring platform is not available, and the acquisition program cannot be managed and monitored; most distributed crawlers are created based on message queues at present, the functions of the crawlers are not uniform, and directional acquisition programs cannot be managed; the acquisition program cannot be dynamically controlled, and the labor cost is high.
The above information disclosed in this background section is only for enhancement of understanding of the background of the technology described herein and, therefore, certain information may be included in the background that does not form the prior art that is already known in this country to a person of ordinary skill in the art.
Disclosure of Invention
The application mainly aims to provide a distributed interconnected data acquisition method and device based on coordination service, a storage medium and a processor, so as to solve the problem of low data acquisition efficiency of an open source acquisition program in the prior art.
In order to achieve the above object, according to an aspect of the present application, there is provided a data acquisition method based on distributed interconnection of coordination services, the data acquisition method including: determining the type of the collected data; configuring an acquisition rule according to the type of the acquired data; determining an acquisition program according to the acquisition rule; and acquiring data according to the acquisition rule and the determined acquisition program.
Further, before determining the type of the acquired data, the acquisition method further comprises: starting a coordination service, wherein the coordination service is used for managing acquisition configuration information of the acquisition program and acquisition state information of the acquisition program; starting task management, acquisition management and configuration management, wherein the configuration management comprises configuration rules; and starting data acquisition according to the configuration rule.
Further, the task management comprises a task list, the acquisition management further comprises server configuration, and the configuration management comprises agent configuration and request configuration.
Further, the coordinating service manages the collection status information, including: when the data acquisition is started, the ip address information of a host where an acquisition program used for the data acquisition is located and the port number information monitored by the program are submitted to the zookeeper; and registering the ip address information and the port number information as temporary nodes, wherein the temporary nodes disappear when the connection between the registration service and the zookeeper is interrupted, so as to monitor the acquisition state information of the acquisition program.
Further, the starting the coordination service further includes: and the acquisition program provides a query interface for acquiring the state information so as to query the cpu and the memory of the server where the program is located.
Further, after acquiring data according to the acquisition rule and the determined acquisition procedure, the acquisition method further comprises: acquiring the acquisition rule which is distributed according to the acquisition in the coordination service; initializing the acquisition configuration information according to the acquisition rule, and acquiring a URL to be captured from a message queue specified by the acquisition configuration information; sending a request to the URL according to a request specified by configuration and the proxy configuration; after the request is successfully sent, response content is obtained and analyzed; under the condition that the response content is a text, extracting and storing the response content according to the configuration, wherein the text contains the response content to be extracted; and under the condition that the response content is not the text, extracting the URL contained in the response content, putting the uncaptured URL into a message queue after rearrangement, and waiting for next capture.
According to another aspect of the present application, there is provided a coordinated service based distributed interconnection acquisition apparatus, including: the first determining unit is used for determining the type of the acquired data; the configuration unit is used for configuring the acquisition rule according to the type of the acquired data; the second determining unit is used for determining an acquisition program according to the acquisition rule; and the third determining unit is used for acquiring data according to the acquisition rule and the determined acquisition program.
Further, the collection device further comprises: the system comprises a first control unit, a second control unit and a third control unit, wherein the first control unit is used for starting a coordination service before determining the type of acquired data, and the coordination service is used for managing acquisition configuration information of an acquisition program and acquisition state information of the acquisition program; the second control unit is used for starting task management, acquisition management and configuration management, and the configuration management comprises configuration rules; and the third control unit is used for starting data acquisition according to the configuration rule. According to another aspect of the application, a processor for running a program is provided, wherein the program when running performs any of the methods.
According to the technical scheme, the method comprises the steps of firstly determining the type of the acquired data, secondly configuring an acquisition rule according to the type of the acquired data, secondly determining an acquisition program according to the acquisition rule, and finally acquiring the data according to the acquisition rule and the determined acquisition program. Therefore, distributed interconnected acquisition can be supported, the complexity of data acquisition is reduced, personnel without relevant foundations can configure relevant acquisition programs, the use range is enlarged, the state of the acquisition programs can be monitored in time, the condition that the acquisition programs are not found in a delayed mode is avoided, the acquisition programs are managed and monitored, the acquisition programs are managed and controlled in an oriented mode, the cost is reduced, and the acquisition efficiency is improved.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this application, illustrate embodiments of the application and, together with the description, serve to explain the application and are not intended to limit the application. In the drawings:
FIG. 1 is a flow chart illustrating a method for data collection based on distributed interconnection of coordination services according to an embodiment of the present application;
FIG. 2 shows a flow diagram of data crawling according to an embodiment of the present application;
FIG. 3 shows a schematic diagram of a coordinated services based distributed interconnect acquisition device according to an embodiment of the present application; and
FIG. 4 illustrates a logic diagram for operation of a coordinated services based distributed interconnect data collection system according to an embodiment of the application.
Detailed Description
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.
In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only partial embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used may be interchanged under appropriate circumstances such that embodiments of the application described herein may be used. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" another element, it can be directly on the other element or intervening elements may also be present. Also, in the specification and claims, when an element is described as being "connected" to another element, the element may be "directly connected" to the other element or "connected" to the other element through a third element.
For convenience of description, some terms or expressions referred to in the embodiments of the present application are explained below:
zookeeper: the distributed application program coordination service is distributed and open source code, is the implementation of Google chubby as an open source, is an important component of Hadoop and Hbase, is software for providing consistency service for distributed application, and provides functions including: configuration maintenance, domain name service, distributed synchronization, group service, etc.
URL: on the WWW, each information resource has a uniform and unique address on the network, called URL, which is the uniform resource locator of the WWW, i.e. the address of the network.
Avro: the system is a data serialization system, provides rich data structure types, is a fast and compressible binary data form, is a file container for storing persistent data, is a Remote Procedure Call (RPC), has a simple dynamic language combination function, and does not need to generate codes when the Avro is combined with the dynamic language, and the data file is read and written and the RPC protocol is used, and the code generation is only worth being realized in a static type language as an optional optimization.
Crawler: the crawler downloads web pages from the world wide web for the search engine, the traditional crawler obtains the URL on the initial web page from the URL of one or a plurality of initial web pages, and continuously extracts new URLs from the current web page and puts the new URLs into a queue in the process of capturing the web page to guide the meeting of certain stop conditions of the system.
Message queue: messages are sent into a queue, a message queue being a container that holds messages during their transmission, a message queue manager acting as a man-in-the-middle in relaying a message from its source to its destination.
As introduced in the background art, in the prior art, an open source acquisition program has a complex acquisition policy and low efficiency, and cannot dynamically acquire and monitor data, and in order to solve the problem of low data acquisition efficiency of the open source acquisition program, a typical embodiment of the present application provides a data acquisition method, an apparatus, a storage medium, and a processor based on distributed interconnection of coordination services.
According to the embodiment of the application, a distributed interconnected data acquisition method based on coordination service is provided. Fig. 1 is a flowchart of a data collection method based on distributed interconnection of coordination services according to an embodiment of the present application. As shown in fig. 1, the method comprises the steps of:
step S101, determining the type of the collected data;
step S102, configuring an acquisition rule according to the type of the acquired data;
step S103, determining an acquisition program according to the acquisition rule;
and step S104, acquiring data according to the acquisition rule and the determined acquisition program.
It should be noted that the steps illustrated in the flowcharts of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowcharts, in some cases, the steps illustrated or described may be performed in an order different than presented herein.
In the method, firstly, the type of the collected data is determined, secondly, a collection rule is configured according to the type of the collected data, secondly, a collection program is determined according to the collection rule, and finally, the data is collected according to the collection rule and the determined collection program. Therefore, distributed interconnected acquisition can be supported, the complexity of data acquisition is reduced, personnel without relevant foundations can configure relevant acquisition programs, the use range is enlarged, the state of the acquisition programs can be monitored in time, the condition that the acquisition programs are not found in a delayed mode is avoided, the acquisition programs are managed and monitored, the acquisition programs are managed and controlled in an oriented mode, the cost is reduced, and the acquisition efficiency is improved.
In an embodiment of the application, before determining the type of the collected data, the collecting method further includes: and starting coordination service, wherein the coordination service is used for managing the acquisition configuration information of the acquisition program and the acquisition state information of the acquisition program, starting task management, acquisition management and configuration management, the configuration management comprises configuration rules, and starting data acquisition according to the configuration rules. The method mainly comprises the steps of adopting an independently developed program, mainly providing interface operation, monitoring the state of the acquisition program, controlling the acquisition program, issuing acquisition configuration and the like, adopting the independently developed program, carrying out data acquisition according to issued configuration rules during main functions, customizing data storage types, mainly article data, adopting an xml document format, distinguishing the acquisition rules through label names, and enabling users to conveniently configure the rules at subsequent times according to the acquired configuration information and state information, and then starting data acquisition according to the configured rules to improve the effectiveness of the data.
In an embodiment of the present application, the task management includes a task list, the collection management further includes server configuration, and the configuration management includes agent configuration and request configuration. The task list lists all collection tasks in a tree structure, collection management comprises rule configuration and server configuration, the rule configuration comprises basic configuration, detailed configuration and collection configuration, the basic configuration comprises task names, data types, collection periods, collection thread numbers and the like, the detailed configuration comprises collection seed geology, data types, field extraction expressions and the like, a collection configuration packet is a collection program associated with the task, the server configuration displays the server address where the collection program is located, the name of the collection program, the cpu of the server, the use condition of a memory, the program running state, program operation starting or stopping, the proxy configuration ip proxy comprises user names, passwords and/or addresses and the like, and requests configuration of request header information and browser types and/or versions and the like.
It should be noted that, the task list shows the relevant information of all the collection tasks in the form of a list, including task name, task id, belonging user, configuration time, belonging collection program, thread number, task state, rule configuration, start and stop, etc., provides a collection plug-in based on a browser, when the collection rule, rule configuration, associated proxy configuration, request configuration, server configuration and/or configuration rule can be automatically generated by clicking, the corresponding xpath expression can be automatically extracted by using the xpath helper plug-in of chrome, after pasting the xpath expression into the corresponding collection field content, the xml configuration file in a fixed format can be automatically generated, provides a mode based on rpc remote calling protocol, controls and manages the collection program, can dynamically configure the collection program collection rule and working state, and server configuration, shows the collection program name in the form of a list, The address of the server, the cpu, the memory, the running state, the start of acquisition, the stop of acquisition and the like. The control center is connected with a server where the zookeeper acquisition program is located and a monitoring port, acquires the running state of the acquisition program and the state of the server through rpc remote method call, and starts or stops acquisition service. When the rule is configured, the contents of the xml configuration file are submitted to the nodes corresponding to the zookeeper, the acquisition program can be regularly acquired and executed, an acquisition strategy based on a protocol request head, a protocol agent, acquisition interval frequency and operation simulation is provided, the acquisition success rate is improved, the agent configuration is realized, the relevant information of an agent pool is configured, after the configuration is completed, different agent ips can be returned through an interface, is used for disguising different address initiation requests, request configuration and configuration of request head related information, is used for disguising requests sent by different browsers and more effectively presenting information to be collected, and the list form can more clearly and orderly feed back the displayed information, thereby reducing the use of the memory, improving the deep management, the plug-in can be used for helping the user accurately and normally locate and helping the user select required elements on a website to inquire required codes.
In an embodiment of the application, the step of managing the collection status information by the coordination service includes: when the data acquisition is started, the ip address information of a host where an acquisition program used for the data acquisition is located and the port number information monitored by the program are submitted to the zookeeper, the ip address information and the port number information are registered as temporary nodes, and the temporary nodes disappear when the connection between the registration service and the zookeeper is interrupted, so that the acquisition state information of the acquisition program is monitored. The zookeeper coordination server-based method is used for monitoring the state of the acquisition program and checking the running state of the program in real time. zookeeper manages the registered services in the node form, can package complex and error-prone key services, and can provide a simple and easy-to-use interface, a high-performance and stable-function system for a user in the follow-up process, so that real-time monitoring of the acquisition state can be realized and the acquisition state can be fed back to the user.
In an embodiment of the application, the starting coordination service further includes: the acquisition program provides an inquiry interface for acquiring the state information so as to inquire the CPU and the memory of the server where the program is located. The method comprises the steps that remote calling of the rpc method can be achieved based on the Avro, a collection program provides a collection state interface, query is provided for messages such as a cpu and a memory of a server where the program is located, when the Avro is used in rpc, the server and a client can exchange modes when handshake connection is carried out, and the server and the client have all modes with each other, so that the problem of consistency needing to be solved in communication among messages such as the same named field, a missing field and a redundant field can be solved, and data collection can be carried out more efficiently and accurately according to rules subsequently.
In an embodiment of the application, after acquiring data according to the acquisition rule and the determined acquisition program, the acquisition method further includes: acquiring the acquisition rule which is distributed according to the acquisition in the coordination service, initializing the acquisition configuration information according to the acquisition rule, acquiring a URL to be captured from a message queue specified by the acquisition configuration information, sending a request to the URL according to a request specified by configuration and the proxy configuration, acquiring response content after the request is successfully sent, analyzing the response content, extracting and storing the response content according to the configuration when the response content is a text, wherein the text comprises the response content to be extracted, extracting the URL contained in the response content when the response content is not the text, placing the uncaptured URL into the message queue after rearrangement, and waiting for the next capture. As shown in fig. 2, the acquisition compliance zookeeper coordination service obtains an acquisition rule allocated to itself, initializes acquisition configuration information by using the rule, and obtains a URL to be captured from a message queue specified in the configuration information. And starting a grabbing thread, and sending a request to the URL according to the request head and the proxy configuration specified by the configuration. And after the request is successful, acquiring the response content and analyzing the content. And judging whether the content is text or not. If so, content extraction is performed according to the configuration and data is stored. If not, extracting the URL contained in the content, and after the rearrangement, putting the uncaptured URL into a message queue. Waiting for the next capture.
It should be noted that, the method of dividing the acquisition program and the message queue according to the group division can manage the directional acquisition and the non-directional acquisition programs at the same time, and can realize the distributed improvement of the acquisition efficiency. When the control center associates the collection configuration rule with the selected collection service, a unique message queue is generated for the collection program. All the acquisition programs share the message queue to achieve the effect of distributed acquisition. The acquisition rule of the directional acquisition service is special, and can only be realized through coding, and cannot be configured through dynamic session configuration. The same message queue cannot be shared with other crawlers. After the directional acquisition crawlers are successfully registered to the distributed coordination service, the control center is configured and issued, the directional crawlers cannot be issued, a plurality of directional acquisition services can be deployed in the same directional acquisition service, the directional acquisition services are grouped under the same task, and the distributed acquisition effect can be realized as that of the common acquisition service. Except that the configuration file is not issued, other non-directional acquisition services are consistent, and the purposes of unified management and unified scheduling can be achieved.
The embodiment of the present application further provides a collection device for distributed interconnection based on coordination service, and it should be noted that the collection device for distributed interconnection based on coordination service according to the embodiment of the present application may be used to execute the collection method for distributed interconnection based on coordination service according to the embodiment of the present application. The following describes a distributed interconnected acquisition device based on coordination service provided by the embodiment of the present application.
Fig. 3 is a schematic diagram of a coordinated service based distributed interconnected acquisition device according to an embodiment of the present application. As shown in fig. 3, the apparatus includes:
a first determination unit 10 for determining the type of the acquired data;
a configuration unit 20, configured to configure an acquisition rule according to the type of the acquired data;
a second determining unit 30, configured to determine an acquisition procedure according to the acquisition rule;
and a third determining unit 40, configured to acquire data according to the acquisition rule and the determined acquisition program.
In the device, a first determining unit determines the type of the collected data, a configuration unit configures a collection rule according to the type of the collected data, a second determining unit determines a collection program according to the collection rule, and a third determining unit collects the data according to the collection rule and the determined collection program. Therefore, distributed interconnected acquisition can be supported, the complexity of data acquisition is reduced, personnel without relevant foundations can configure relevant acquisition programs, the use range is enlarged, the state of the acquisition programs can be monitored in time, the condition that the acquisition programs are not found in a delayed mode is avoided, the acquisition programs are managed and monitored, the acquisition programs are managed and controlled in an oriented mode, the cost is reduced, and the acquisition efficiency is improved.
In an embodiment of the present application, the apparatus further includes a first control unit, a second control unit, and a third control unit, where the first control unit is configured to start a coordination service before determining a type of the collected data, the coordination service is configured to manage collection configuration information of the collection program and collection state information of the collection program, the second control unit is configured to start task management, collection management, and configuration management, the configuration management includes a configuration rule, and the third control unit is configured to start data collection according to the configuration rule. The method mainly comprises the steps of adopting an independently developed program, mainly providing interface operation, monitoring the state of the acquisition program, controlling the acquisition program, issuing acquisition configuration and the like, adopting the independently developed program, carrying out data acquisition according to issued configuration rules during main functions, customizing data storage types, mainly article data, adopting an xml document format, distinguishing the acquisition rules through label names, and enabling users to conveniently configure the rules at subsequent times according to the acquired configuration information and state information, and then starting data acquisition according to the configured rules to improve the effectiveness of the data.
In an embodiment of the present application, the task management in the second control unit includes a task list, the collection management further includes server configuration, and the configuration management includes agent configuration and request configuration. The collection management also comprises server configuration, and the configuration management comprises agent configuration and request configuration. The task list lists all collection tasks in a tree structure, collection management comprises rule configuration and server configuration, the rule configuration comprises basic configuration, detailed configuration and collection configuration, the basic configuration comprises task names, data types, collection periods, collection thread numbers and the like, the detailed configuration comprises collection seed geology, data types, field extraction expressions and the like, a collection configuration packet is a collection program associated with the task, the server configuration displays the server address where the collection program is located, the name of the collection program, the cpu of the server, the use condition of a memory, the program running state, program operation starting or stopping, the proxy configuration ip proxy comprises user names, passwords, and/or addresses and the like, and the request configuration request header information comprises browser types and/or versions and the like.
It should be noted that, the task list shows the relevant information of all the collection tasks in the form of a list, including task name, task id, belonging user, configuration time, belonging collection program, thread number, task state, rule configuration, start and stop, etc., provides a collection plug-in based on a browser, when the collection rule, rule configuration, associated proxy configuration, request configuration, server configuration and/or configuration rule can be automatically generated by clicking, the corresponding xpath expression can be automatically extracted by using the xpath helper plug-in of chrome, after pasting the xpath expression into the corresponding collection field content, the xml configuration file in a fixed format can be automatically generated, provides a mode based on rpc remote calling protocol, controls and manages the collection program, can dynamically configure the collection program collection rule and working state, and server configuration, shows the collection program name in the form of a list, The address of the server, the cpu, the memory, the running state, the start of acquisition, the stop of acquisition and the like. The control center is connected with a server where the zookeeper acquisition program is located and a monitoring port, acquires the running state of the acquisition program and the state of the server through rpc remote method call, and starts or stops acquisition service. When the rule is configured, the contents of the xml configuration file are submitted to the nodes corresponding to the zookeeper, the acquisition program can be regularly acquired and executed, an acquisition strategy based on a protocol request head, a protocol agent, acquisition interval frequency and operation simulation is provided, the acquisition success rate is improved, the agent configuration is realized, the relevant information of an agent pool is configured, after the configuration is completed, different agent ips can be returned through an interface, is used for disguising different address initiation requests, request configuration and configuration of request head related information, is used for disguising requests sent by different browsers and more effectively presenting information to be collected, and the list form can more clearly and orderly feed back the displayed information, thereby reducing the use of the memory, improving the deep management, the plug-in can be used for helping the user accurately and normally locate and helping the user select required elements on a website to inquire required codes.
In an embodiment of the application, the first control unit includes a submitting module and a monitoring module, the submitting module is configured to submit, to the zookeeper, ip address information of a host where an acquisition program used for the data acquisition is located and port number information monitored by the program when the data acquisition is started, the monitoring module is configured to register the ip address information and the port number information as a temporary node, and the temporary node disappears when connection between a registration service and the zookeeper is interrupted, so as to monitor acquisition state information of the acquisition program. The zookeeper coordination server-based method is used for monitoring the state of the acquisition program and checking the running state of the program in real time. zookeeper manages the registered services in the node form, can package complex and error-prone key services, and can provide a simple and easy-to-use interface, a high-performance and stable-function system for a user in the follow-up process, so that real-time monitoring of the acquisition state can be realized and the acquisition state can be fed back to the user.
In an embodiment of the application, the first control unit module includes a query module, configured to provide, by the acquisition program, a query interface of the acquisition status information, so as to query a cpu and a memory of a server where the acquisition program is located. The method comprises the steps that remote calling of the rpc method can be achieved based on the Avro, a collection program provides a collection state interface, query is provided for messages such as a cpu and a memory of a server where the program is located, when the Avro is used in rpc, the server and a client can exchange modes when handshake connection is carried out, and the server and the client have all modes with each other, so that the problem of consistency needing to be solved in communication among messages such as the same named field, a missing field and a redundant field can be solved, and data collection can be carried out more efficiently and accurately according to rules subsequently.
In an embodiment of the application, the apparatus further includes a collecting unit, an initializing unit, a sending unit, an acquiring unit, a first extracting unit and a second extracting unit, the collecting unit is configured to collect the collecting rule assigned to the coordination service after collecting data according to the collecting rule and the determined collecting program, the initializing unit is configured to initialize the collecting configuration information according to the collecting rule and acquire a URL to be captured from a message queue specified by the collecting configuration information, the sending unit is configured to send a request to the URL according to a request specified by configuration and the proxy configuration, the acquiring unit is configured to acquire response content and analyze the response content after the request is successfully sent, the first extracting unit is configured to extract and store the response content according to the configuration when the response content is a text, the text contains the response content to be extracted, and the second extraction unit is used for extracting the URL contained in the response content under the condition that the response content is not the text, putting the uncaptured URL into a message queue after rearrangement, and waiting for next capture. As shown in fig. 2, the acquisition compliance zookeeper coordination service obtains an acquisition rule allocated to itself, initializes acquisition configuration information by using the rule, and obtains a URL to be captured from a message queue specified in the configuration information. And starting a grabbing thread, and sending a request to the URL according to the request head and the proxy configuration specified by the configuration. And after the request is successful, acquiring the response content and analyzing the content. And judging whether the content is text or not. If so, content extraction is performed according to the configuration and data is stored. If not, extracting the URL contained in the content, and after the rearrangement, putting the uncaptured URL into a message queue. Waiting for the next capture.
It should be noted that, a method for dividing the acquisition program and the message queue by group is provided, which can simultaneously manage the directional acquisition and the non-directional acquisition programs, and can realize distributed acquisition efficiency improvement. When the control center associates the collection configuration rule with the selected collection service, a unique message queue is generated for the collection program. All the acquisition programs share the message queue to achieve the effect of distributed acquisition. The acquisition rule of the directional acquisition service is special, and can only be realized through coding, and cannot be configured through dynamic session configuration. The same message queue cannot be shared with other crawlers. After the directional acquisition crawlers are successfully registered to the distributed coordination service, the control center is configured and issued, the directional crawlers cannot be issued, a plurality of directional acquisition services can be deployed in the same directional acquisition service, the directional acquisition services are grouped under the same task, and the distributed acquisition effect can be realized as that of the common acquisition service. Except that the configuration file is not issued, other non-directional acquisition services are consistent, and the purposes of unified management and unified scheduling can be achieved.
The acquisition device based on the distributed interconnection of the coordination service comprises a processor and a memory, wherein the first determining unit, the configuration unit, the second determining unit, the third determining unit and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the efficiency of data acquisition of the open source acquisition program is improved by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The embodiment of the invention provides a storage medium, wherein a program is stored on the storage medium, and when the program is executed by a processor, the acquisition method based on the distributed interconnection of the coordination service is realized.
The embodiment of the invention provides a processor, which is used for running a program, wherein the acquisition method based on the distributed interconnection of the coordination service is executed when the program runs.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein when the processor executes the program, at least the following steps are realized:
step S101, determining the type of the collected data;
step S102, configuring an acquisition rule according to the type of the acquired data;
step S103, determining an acquisition program according to the acquisition rule;
and step S104, acquiring data according to the acquisition rule and the determined acquisition program.
The device herein may be a server, a PC, a PAD, a mobile phone, etc.
The present application further provides a computer program product adapted to perform a program of initializing at least the following method steps when executed on a data processing device:
step S101, determining the type of the collected data;
step S102, configuring an acquisition rule according to the type of the acquired data;
step S103, determining an acquisition program according to the acquisition rule;
and step S104, acquiring data according to the acquisition rule and the determined acquisition program.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in the process, method, article, or apparatus that comprises the element.
Examples
The embodiment relates to an operation logic diagram of a distributed interconnected data acquisition system based on coordination service, as shown in fig. 4, the acquisition system comprises a control center, distributed coordination service and acquisition service, a task list displays information as list information and provides the list information for configuration rules, the configuration information comprises rule configuration, request configuration and agent configuration information, dynamic management configuration is carried out, the acquisition service is registered and managed after the acquisition service is started, response configuration is obtained, acquisition program is controlled, whether the acquisition program is started or stopped is selected, real-time monitoring is carried out on the program, data is captured, and the condition of resources is monitored.
From the above description, it can be seen that the above-described embodiments of the present application achieve the following technical effects:
1) the distributed interconnection acquisition method based on the coordination service firstly determines the type of the acquired data, secondly configures an acquisition rule according to the type of the acquired data, secondly determines an acquisition program according to the acquisition rule, and finally acquires the data according to the acquisition rule and the determined acquisition program. Therefore, distributed interconnected acquisition can be supported, the complexity of data acquisition is reduced, personnel without relevant foundations can configure relevant acquisition programs, the use range is enlarged, the state of the acquisition programs can be monitored in time, the condition that the acquisition programs are not found in a delayed mode is avoided, the acquisition programs are managed and monitored, the acquisition programs are managed and controlled in an oriented mode, the cost is reduced, and the acquisition efficiency is improved.
2) According to the distributed interconnected acquisition device based on the coordination service, the first determination unit determines the type of acquired data, the configuration unit configures acquisition rules according to the type of the acquired data, the second determination unit determines an acquisition program according to the acquisition rules, and the third determination unit acquires data according to the acquisition rules and the determined acquisition program. Therefore, distributed interconnected acquisition can be supported, the complexity of data acquisition is reduced, personnel without relevant foundations can configure relevant acquisition programs, the use range is enlarged, the state of the acquisition programs can be monitored in time, the condition that the acquisition programs are not found in a delayed mode is avoided, the acquisition programs are managed and monitored, the acquisition programs are managed and controlled in an oriented mode, the cost is reduced, and the acquisition efficiency is improved.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.
Claims (10)
1. A data acquisition method based on distributed interconnection of coordination services is characterized by comprising the following steps:
determining the type of the collected data;
configuring an acquisition rule according to the type of the acquired data;
determining an acquisition program according to the acquisition rule;
and acquiring data according to the acquisition rule and the determined acquisition program.
2. The method of claim 1, wherein prior to determining the type of data acquired, the acquisition method further comprises:
starting a coordination service, wherein the coordination service is used for managing acquisition configuration information of the acquisition program and acquisition state information of the acquisition program;
starting task management, acquisition management and configuration management, wherein the configuration management comprises configuration rules;
and starting data acquisition according to the configuration rule.
3. The method of claim 2, wherein the task management comprises a task list, wherein the collection management further comprises server configuration, and wherein the configuration management comprises agent configuration and request configuration.
4. The method of claim 2, wherein the coordinating service manages the collection status information, comprising:
when the data acquisition is started, the ip address information of a host where an acquisition program used for the data acquisition is located and the port number information monitored by the program are submitted to the zookeeper;
and registering the ip address information and the port number information as temporary nodes, wherein the temporary nodes disappear when the connection between the registration service and the zookeeper is interrupted, so as to monitor the acquisition state information of the acquisition program.
5. The method of claim 4, wherein the initiating a orchestration service further comprises: and the acquisition program provides a query interface for acquiring the state information so as to query the cpu and the memory of the server where the program is located.
6. The method of claim 3, wherein after acquiring data according to the acquisition rules and the determined acquisition procedures, the acquisition method further comprises:
acquiring the acquisition rule which is distributed according to the acquisition in the coordination service;
initializing the acquisition configuration information according to the acquisition rule, and acquiring a URL to be captured from a message queue specified by the acquisition configuration information;
sending a request to the URL according to a request specified by configuration and the proxy configuration;
after the request is successfully sent, response content is obtained and analyzed;
under the condition that the response content is a text, extracting and storing the response content according to the configuration, wherein the text contains the response content to be extracted;
and under the condition that the response content is not the text, extracting the URL contained in the response content, putting the uncaptured URL into a message queue after rearrangement, and waiting for next capture.
7. A collection system based on distributed interconnection of coordination services, comprising:
the first determining unit is used for determining the type of the acquired data;
the configuration unit is used for configuring the acquisition rule according to the type of the acquired data;
the second determining unit is used for determining an acquisition program according to the acquisition rule;
and the third determining unit is used for acquiring data according to the acquisition rule and the determined acquisition program.
8. The apparatus of claim 7, wherein the acquisition device further comprises:
the system comprises a first control unit, a second control unit and a third control unit, wherein the first control unit is used for starting a coordination service before determining the type of acquired data, and the coordination service is used for managing acquisition configuration information of an acquisition program and acquisition state information of the acquisition program;
the second control unit is used for starting task management, acquisition management and configuration management, and the configuration management comprises configuration rules;
and the third control unit is used for starting data acquisition according to the configuration rule.
9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the program performs the method of any one of claims 1 to 6.
10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any of claims 1 to 6.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911399135.9A CN111130900A (en) | 2019-12-30 | 2019-12-30 | Data acquisition method and device based on distributed interconnection of coordination services |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201911399135.9A CN111130900A (en) | 2019-12-30 | 2019-12-30 | Data acquisition method and device based on distributed interconnection of coordination services |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN111130900A true CN111130900A (en) | 2020-05-08 |
Family
ID=70505580
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201911399135.9A Pending CN111130900A (en) | 2019-12-30 | 2019-12-30 | Data acquisition method and device based on distributed interconnection of coordination services |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN111130900A (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112631561A (en) * | 2020-12-29 | 2021-04-09 | 智慧神州(北京)科技有限公司 | Data source docking method and device, processor and data source docking system |
| CN112653588A (en) * | 2020-07-10 | 2021-04-13 | 深圳市唯特视科技有限公司 | Adaptive network traffic collection method, system, electronic device and storage medium |
| CN114065092A (en) * | 2021-11-10 | 2022-02-18 | 奇安信科技集团股份有限公司 | Website identification method, device, computer equipment and storage medium |
| CN116578605A (en) * | 2023-04-19 | 2023-08-11 | 广东畅视科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110302239A1 (en) * | 2010-06-04 | 2011-12-08 | International Business Machines Corporation | Managing Rule Sets as Web Services |
| CN102739775A (en) * | 2012-05-29 | 2012-10-17 | 宁波东冠科技有限公司 | Method for monitoring and managing Internet of Things data acquisition server cluster |
| CN104714875A (en) * | 2015-03-11 | 2015-06-17 | 浪潮集团有限公司 | Distributed automatic collecting method |
| CN106776693A (en) * | 2016-11-10 | 2017-05-31 | 福建中金在线信息科技有限公司 | A kind of website data acquisition method and device |
-
2019
- 2019-12-30 CN CN201911399135.9A patent/CN111130900A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US20110302239A1 (en) * | 2010-06-04 | 2011-12-08 | International Business Machines Corporation | Managing Rule Sets as Web Services |
| CN102739775A (en) * | 2012-05-29 | 2012-10-17 | 宁波东冠科技有限公司 | Method for monitoring and managing Internet of Things data acquisition server cluster |
| CN104714875A (en) * | 2015-03-11 | 2015-06-17 | 浪潮集团有限公司 | Distributed automatic collecting method |
| CN106776693A (en) * | 2016-11-10 | 2017-05-31 | 福建中金在线信息科技有限公司 | A kind of website data acquisition method and device |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112653588A (en) * | 2020-07-10 | 2021-04-13 | 深圳市唯特视科技有限公司 | Adaptive network traffic collection method, system, electronic device and storage medium |
| CN112631561A (en) * | 2020-12-29 | 2021-04-09 | 智慧神州(北京)科技有限公司 | Data source docking method and device, processor and data source docking system |
| CN114065092A (en) * | 2021-11-10 | 2022-02-18 | 奇安信科技集团股份有限公司 | Website identification method, device, computer equipment and storage medium |
| CN114065092B (en) * | 2021-11-10 | 2025-03-21 | 奇安信科技集团股份有限公司 | Website identification method, device, computer equipment and storage medium |
| CN116578605A (en) * | 2023-04-19 | 2023-08-11 | 广东畅视科技有限公司 | Data acquisition method and device, electronic equipment and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US9672140B1 (en) | Processing special requests at dedicated application containers | |
| CN112632566B (en) | Vulnerability scanning method and device, storage medium and electronic equipment | |
| CN111130900A (en) | Data acquisition method and device based on distributed interconnection of coordination services | |
| CN109783562B (en) | A business processing method and device | |
| JP2021515950A (en) | Systems and methods for cloud computing | |
| CN111262839A (en) | Vulnerability scanning method, management equipment, node and storage medium | |
| CN105245373A (en) | Construction and operation method of container cloud platform system | |
| CN108243055B (en) | A container cloud automatic discovery and registration system and method | |
| CN109522386B (en) | Method and system for generating spatial information service across GIS platform | |
| CN111338893A (en) | Process log processing method and device, computer equipment and storage medium | |
| CN107809383A (en) | A kind of map paths method and device based on MVC | |
| US10182104B1 (en) | Automatic propagation of resource attributes in a provider network according to propagation criteria | |
| CN104967644A (en) | Message push method, apparatus and system | |
| US20250291857A1 (en) | System and method for web scraping and countermeasure solver | |
| CN110661780A (en) | Wireless city data sharing method and system based on SAAS application | |
| CN113676563A (en) | Scheduling method, device, equipment and storage medium of content distribution network service | |
| CN111124617B (en) | Method and device for creating block chain system, storage medium and electronic device | |
| CN117389830A (en) | Cluster log acquisition method and device, computer equipment and storage medium | |
| EP4363976A1 (en) | Streaming analytics using a serverless compute system | |
| US10225358B2 (en) | Page push method, device, server and system | |
| US7543300B2 (en) | Interface for application components | |
| WO2017088347A1 (en) | Method, device and system for managing user usage information of application based on paas platform | |
| CN115632815A (en) | Data updating method and device, electronic equipment and storage medium | |
| WO2018188607A1 (en) | Stream processing method and device | |
| CN112631996A (en) | Log search method and device |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| TA01 | Transfer of patent application right | ||
| TA01 | Transfer of patent application right |
Effective date of registration: 20200806 Address after: 1608, 14 / F, No. 65, Beisihuan West Road, Haidian District, Beijing 100080 Applicant after: BEIJING INTERNETWARE Ltd. Address before: No. 603, floor 6, No. 9, Shangdi 9th Street, Haidian District, Beijing 100085 Applicant before: Smart Shenzhou (Beijing) Technology Co.,Ltd. |
|
| WD01 | Invention patent application deemed withdrawn after publication | ||
| WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200508 |