[go: up one dir, main page]

CN119441325A - Data collection method, device, computer equipment and storage medium - Google Patents

Data collection method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN119441325A
CN119441325A CN202310946867.5A CN202310946867A CN119441325A CN 119441325 A CN119441325 A CN 119441325A CN 202310946867 A CN202310946867 A CN 202310946867A CN 119441325 A CN119441325 A CN 119441325A
Authority
CN
China
Prior art keywords
data
interface
interface type
reliability enhancement
data acquisition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310946867.5A
Other languages
Chinese (zh)
Other versions
CN119441325B (en
Inventor
王�华
王岩
曾敬勇
徐慧如
王云飞
丁尚君
张为民
刘建
杨传江
王剑
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunlun Digital Technology Co ltd
China National Petroleum Corp
Original Assignee
Kunlun Digital Technology Co ltd
China National Petroleum Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunlun Digital Technology Co ltd, China National Petroleum Corp filed Critical Kunlun Digital Technology Co ltd
Priority to CN202310946867.5A priority Critical patent/CN119441325B/en
Publication of CN119441325A publication Critical patent/CN119441325A/en
Application granted granted Critical
Publication of CN119441325B publication Critical patent/CN119441325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请提供了一种数据采集方法、装置、计算机设备及存储介质,属于数据管理技术领域。在本申请中,通过为不同接口类型的数据接口自动配置可靠性增强策略,可以保证执行数据采集任务时不遗漏数据,避免数据缺失,增强了数据的完整性,且对于拥有多个副本的数据源,上述方法可以保证数据源多个副本之间的数据的一致性,进一步地,在基于可靠性增强策略进行数据采集的基础上,对采集到的数据进行数据完整性校验,更加保证了数据的完整性,然后对采集到的数据去重,提高了数据质量。综合使用上述方法,保证了数据的可靠性。

The present application provides a data collection method, device, computer equipment and storage medium, which belongs to the field of data management technology. In the present application, by automatically configuring reliability enhancement strategies for data interfaces of different interface types, it can be ensured that no data is missed when executing data collection tasks, avoiding data loss, and enhancing data integrity. For data sources with multiple copies, the above method can ensure the consistency of data between multiple copies of the data source. Furthermore, on the basis of data collection based on the reliability enhancement strategy, the collected data is checked for data integrity, which further ensures the integrity of the data, and then the collected data is deduplicated to improve the data quality. The above methods are used in combination to ensure the reliability of the data.

Description

Data acquisition method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of data management technologies, and in particular, to a data acquisition method, a data acquisition device, a computer device, and a storage medium.
Background
The database is an important technical platform for data management, and the data acquisition is used as a core link of the data management, so that the database has great significance for the data management.
The current data acquisition methods all acquire data based on a certain fixed rule. For example, the target collection data is the data with the designated time, and then the server takes the designated time as the input parameter of the data source system, and collects the data with the updated time in the data source system as the designated time through the data interface.
However, if the system is abnormal or in a critical state, the data may be missed when the data acquisition task is executed, so that the integrity of the data is affected, and if the data source has multiple copies, the data stored in the multiple copies may be inconsistent due to the problems of delay or untimely data synchronization and the like of the network, so that the acquired data is inconsistent. In summary, the data reliability is poor.
Disclosure of Invention
The embodiment of the application provides a data acquisition method, a data acquisition device, computer equipment and a storage medium, which ensure the integrity and consistency of data and enhance the reliability of the data, and the technical scheme is as follows:
In one aspect, a data acquisition method is provided, the method comprising:
In response to receiving an incoming parameter of a data acquisition task, optimizing the incoming parameter based on a target reliability enhancement strategy of a data interface, wherein the target reliability enhancement strategy is an incoming parameter processing mode corresponding to an interface type to which the data interface belongs;
calling the data interface according to the optimized input parameters to execute the data acquisition task;
and processing the acquired data and storing the processed data into a database.
In another aspect, there is provided a data acquisition device, the device comprising:
The data interface comprises an incoming parameter optimization module, a data interface and a data interface, wherein the incoming parameter optimization module is used for responding to the received incoming parameters of the data acquisition task and optimizing the incoming parameters based on a target reliability enhancement strategy of the data interface, and the target reliability enhancement strategy is an incoming parameter processing mode corresponding to the interface type of the data interface;
The data interface calling module is used for calling the data interface according to the optimized input parameters so as to execute the data acquisition task;
the data processing module is used for processing the acquired data and storing the processed data into the database.
In one possible implementation, the apparatus further includes:
The policy configuration module is configured to configure a target reliability enhancement policy for a data interface of a data management platform based on an interface type to which the data interface belongs, where the target reliability enhancement policy is an incoming parameter processing mode corresponding to the interface type to which the data interface belongs.
In one possible implementation manner, the policy configuration module is configured to, for a first interface type, where the first interface type refers to a data interface using natural time as an increment field, configure a first reliability enhancement policy for the data interface if a data acquisition protocol supported by the data interface is data with a transmission update time greater than the incoming parameter, where the first reliability enhancement policy is a subtraction of a time unit from the incoming parameter of the data acquisition task;
And for a second interface type, the second interface type refers to a data interface taking a start-stop time period as an increment field, a data acquisition protocol supported by the data interface is used for transmitting data with updated time in the start-stop time period, and a second reliability enhancement strategy is configured for the data interface, and is used for respectively moving forwards and backwards for a specified time step by taking the incoming parameter as a base line.
In one possible implementation manner, the input parameter optimization module is configured to subtract a time unit from the input parameter of the data acquisition task for a first interface type if a data acquisition protocol supported by the data interface is data with a transmission update time longer than the input parameter;
For the second interface type, the data acquisition protocol supported by the data interface is to transmit data with updated time in a start-stop time period, and then the data acquisition protocol is respectively moved forwards and backwards by a specified time step by taking the incoming parameter as a base line.
In one possible implementation manner, the data processing module is configured to perform a corresponding data integrity check on the collected data based on an interface type of the data interface, perform a data deduplication step if the integrity check is passed, and re-perform the data collection task if the integrity check is not passed;
and comparing the acquired data with the data in the buffer area to remove the repeated data.
In one possible implementation, the apparatus further includes:
the scheme configuration module is used for acquiring a target data integrity check scheme from a mapping relation between the interface type and a data integrity check scheme based on the interface type of the data interface, and configuring the target data integrity check scheme for the data interface, wherein the target data integrity check scheme is the data integrity check scheme corresponding to the interface type.
In one possible implementation manner, the data processing module is configured to, for a third interface type, compare the version number of the collected data with the version number of the data in the buffer area one by one, remove duplicate data in the collected data if the version number of the collected data is the same as the version number of the data in the buffer area, and reserve the data in the collected data if the version number of the collected data is different from the version number of the data in the buffer area;
And for other interface types, comparing the main key of the acquired data with the main key of the data in the cache area one by one, if the main key of the acquired data is the same as the main key of the data in the cache area, removing repeated data in the acquired data, and if the main key of the acquired data is different from the main key of the data in the cache area, reserving the data in the acquired data.
In another aspect, a computer device is provided that includes a processor and a memory for storing at least one segment of a computer program that is loaded and executed by the processor to implement operations performed by a data collection method in an alternative implementation of the application.
In another aspect, a computer readable storage medium having stored therein at least one segment of a computer program that is loaded and executed by a processor to perform operations as performed by a data acquisition method in an alternative implementation of the application is provided.
In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer program code stored in a computer readable storage medium, the computer program code being read from the computer readable storage medium by a processor of a computer device, the computer program code being executed by the processor such that the computer device performs the operations performed by the data acquisition method provided in the various alternative implementations described above.
In the embodiment of the application, the reliability enhancement strategies are automatically configured for the data interfaces with different interface types, so that data is not missed when the data acquisition task is executed, data missing is avoided, the integrity of the data is enhanced, and for the data source with multiple copies, the method can ensure the consistency of the data among the multiple copies of the data source, further, the data integrity check is carried out on the acquired data on the basis of the reliability enhancement strategies, the integrity of the data is further ensured, the data is de-duplicated, and the data quality is improved. By comprehensively using the method, the reliability of the data is ensured.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is an environment for implementing a data acquisition method according to an embodiment of the present application;
FIG. 2 is a flow chart of a data acquisition method according to an embodiment of the present application;
FIG. 3 is a flow chart of a method for reliability enhancement policy configuration for a data interface of different interface types according to an embodiment of the present application;
FIG. 4 is a block diagram of a data acquisition device provided in accordance with an embodiment of the present application;
FIG. 5 is a block diagram of a system architecture of a data acquisition method according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
The term "at least one" in the present application means one or more, and the meaning of "a plurality of" means two or more.
The term "and/or" in the present application describes an association relationship of association objects, and indicates that three relationships may exist, for example, a and/or B may indicate that a exists alone, a and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
It should be noted that, the data (including, but not limited to, data for analysis, stored data, displayed data, etc.) related to the present application are all authorized by the user or are fully authorized by the parties, and the collection, use and processing of the related data are required to comply with the related laws and regulations and standards of the related countries and regions.
Fig. 1 is a schematic diagram of an implementation environment of a data acquisition method according to an embodiment of the present application. Referring to FIG. 1, the implementation environment includes a data management platform 101 and a plurality of data source systems 102.
The data management platform 101 is configured to collect data from each data source system 102, and process and store the collected data to provide data services. In some embodiments, the data management platform 101 is a stand-alone physical server, or can be a server cluster or a distributed system formed by a plurality of physical servers, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
Wherein, a plurality of data source systems 102 are used for providing data for the data management platform 101, and the plurality of data source systems 102 perform data transmission with the data management platform 101 through a data interface.
In some embodiments, the servers in the data management platform 101 may be directly or indirectly connected through wired or wireless communication, which is not limited herein.
In some embodiments, the wireless network or wired network described above uses standard communication techniques and/or protocols. The network is typically the internet, but can be any network including, but not limited to, any combination of LANs (Local Area Network, local area networks), MANs (Metropolitan Area Network, metropolitan area networks), WANs (Wide Area Network, wide area networks), mobile, wired or wireless networks, private networks, or virtual private networks. In some embodiments, the data exchanged over the network is represented using techniques and/or formats including HTML (HyperText Mark-up Language), XML (Extensible Markup Language, an extensible markup Language), and the like. In addition, all or some of the links can be encrypted using conventional encryption techniques such as SSL (Secure Socket Layer ), TLS (Transport Layer Security, transport layer security), VPN (Virtual Private Network ), IPsec (Internet Protocol Security, internet protocol security), etc. In other embodiments, custom and/or dedicated data communication techniques can also be used in place of or in addition to the data communication techniques described above.
Fig. 2 is a flowchart of a data collection method according to an embodiment of the present application, as shown in fig. 2, taking a server of a data management platform as an example for an execution subject, the method includes the following steps:
201. Based on the interface type of the data interface of the data management platform, configuring a target reliability enhancement strategy for the data interface, wherein the target reliability enhancement strategy is the reliability enhancement strategy corresponding to the interface type.
In the embodiment of the application, the reliability enhancement policy refers to an incoming parameter processing mode which is newly added for enhancing the reliability of the data interface. The input parameters are data acquisition conditions corresponding to the data acquisition tasks.
The interface type is determined by both automatic identification and manual selection. The automatic identification process includes identifying the interface type of the data interface according to the protocol field supported by the data interface. The manual selection process includes providing a plurality of selectable interface types based on the data interface, and determining the selected interface type as the interface type to which the data interface belongs based on a selection operation of any interface type by a management user.
In some embodiments, the step 201 includes obtaining a target reliability enhancement policy from a mapping relationship between an interface type and a reliability enhancement policy based on the interface type, where the target reliability enhancement policy is a reliability enhancement policy corresponding to the interface type, and configuring the target reliability enhancement policy for the data interface.
The above step 201 is a process of performing policy optimization on the data interface, which may be performed at any stage, and for each data acquisition process, the policy optimization process need not be performed at each data acquisition, but may be performed based on a policy that is already configured at present.
Referring to fig. 3, different reliability enhancement policies are configured for different interface types, and the reliability enhancement policies are explained below based on the different interface types.
The delta fields referred to in the first four cases below are the fields to which the newly added content of the data in the data source system belongs.
In the first case, the interface type is a first interface type, and the first interface type refers to a data interface taking natural time as an increment field.
Wherein, the natural time refers to a time unit with larger granularity, such as a year, a month, a day, a season, a week, an hour, a minute and the like, with accurate time.
For a first interface type, determining whether to configure a reliability enhancement policy for the data interface based on a data acquisition protocol supported by the data interface. And if the data acquisition protocol supported by the data interface is data with the transmission update time being greater than or equal to the input parameter, not configuring a reliability enhancement strategy for the data interface, transmitting data by using the original data interface, and if the data acquisition protocol supported by the data interface is data with the transmission update time being greater than the input parameter, configuring a first reliability enhancement strategy for the data interface, wherein the first reliability enhancement strategy is to subtract a time unit from the input parameter of the data acquisition task so as to obtain the optimized input parameter.
For a data interface which supports a data acquisition protocol and transmits data with update time being more than or equal to that of an incoming parameter, the data transmitted by the data interface meets the requirements on consistency and integrity, and a reliability enhancement strategy is not required to be configured for the data interface. For the data interface supporting the data acquisition protocol for transmitting the data with the update time longer than the input parameter, when the start time of the input parameter issued by the data acquisition task is the same as the data update time of the data source system in some critical states, the data updated in the critical state time may be omitted based on the data acquisition protocol, so that a reliability enhancement strategy is configured for the data interface to prevent missing the data updated in the critical state time.
In the second case, the interface type is a second interface type, and the second interface type refers to a data interface taking a start-stop time period as an increment field.
The start-stop time period may be determined by a start time point and an end time point, or may be determined by a start time point and a data acquisition duration, but is not limited thereto.
And aiming at a second interface type, the data acquisition protocol supported by the data interface is data with transmission update time in a start-stop time period, and a second reliability enhancement strategy is configured for the data interface, wherein the second reliability enhancement strategy is to take an incoming parameter as a base line, respectively move forward and backward for a designated time step to obtain a new incoming parameter, and obtain a new start-stop time period according to the new incoming parameter.
For a data interface supporting a data acquisition protocol for transmitting data with update time in a start-stop time period, at a start time point and a stop time point of a data acquisition task, only partial data may be acquired based on the data acquisition protocol, so that a reliability enhancing strategy is configured for the data interface to obtain all data updated by a data source system in the start-stop time period.
In the third case, the interface type is a third interface type, and the third interface type refers to a data interface with a version number as an increment field.
For the third interface type, the data acquisition protocol supported by the data interface is data with a transmission version number corresponding to the input parameter, and the data in the data source system has unique version numbers, so that the integrity and consistency of the data are ensured, and therefore, the data interface does not need to be configured with a reliability enhancement strategy, and the data are transmitted by using the original data interface.
In the fourth case, the interface type is a fourth interface type, and the fourth interface type refers to a data interface taking a designated time as an increment field.
For the fourth interface type, the data acquisition protocol supported by the data interface is data with the same update time as the input parameters, and the data interface transmits the data with the designated time each time because the update time of the data in the data source system is unique, so that the integrity and consistency of the data are ensured, and therefore, the data interface does not need to be configured with a reliability enhancement strategy, and the data are transmitted by using the original data interface.
Fifth, the interface type is other interface type.
For data interfaces of other interface types, custom reliability enhancement policies may be configured for them.
202. In response to receiving an incoming parameter of a data acquisition task, the incoming parameter is optimized based on a target reliability enhancement policy of the data interface.
In some embodiments, for a first interface type, if the data acquisition protocol supported by the data interface is data with a transmission update time greater than or equal to an incoming parameter, a reliability enhancement policy is not configured for the data interface, and accordingly, the incoming parameter is not optimized, and if the data acquisition protocol supported by the data interface is data with a transmission update time greater than the incoming parameter, the first reliability enhancement policy configured for the data interface is to subtract a time unit from the incoming parameter of the data acquisition task, and optimize the incoming parameter based on the first reliability enhancement policy.
For example, the data acquisition task is to acquire data updated by the data source system after 8:30, where the incoming parameter is 8:30 and the time unit is minutes, and the data acquisition protocol supported by the data interface is data with a transmission update time longer than the incoming parameter, and based on the first reliability enhancement policy, subtracting a time unit from the incoming parameter of the data acquisition task, and adjusting the incoming parameter from 8:30 to 8:29 to optimize the incoming parameter.
In some embodiments, for a second interface type, the data acquisition protocol supported by the data interface is to transmit data with an update time in a start-stop time period, and then a second reliability enhancement policy configured for the data interface is to move forward and backward by a specified time step with an incoming parameter as a baseline, and optimize the incoming parameter based on the second reliability enhancement policy.
For example, the data acquisition task is to acquire data updated by the data source system at 8:30 to 8:40, wherein the incoming parameters are 8:30 and 8:40, the designated time step is 1 minute, and the data acquisition protocol supported by the data interface is to transmit the data with the update time in the start-stop time period, and then based on the second reliability enhancement policy, the incoming parameters are taken as a base line, and the designated time step is moved forwards and backwards respectively, and the incoming parameters are adjusted from 8:30 and 8:40 to 8:29 and 8:41, so as to optimize the incoming parameters.
In some embodiments, for the third interface type, the data acquisition protocol supported by the data interface is data with a transmission version number corresponding to the incoming parameter, and since the data in the data source system all have unique version numbers, the integrity and consistency of the data are ensured, and therefore, the reliability enhancement policy is not configured for the data interface, and correspondingly, the incoming parameter is not optimized.
In some embodiments, for the fourth interface type, the data acquisition protocol supported by the data interface is data with the same update time as the incoming parameter, and because the update time of the data in the data source system has uniqueness, the data interface transmits the data with the designated time each time, so that the integrity and consistency of the data are ensured, and therefore, the reliability enhancement strategy is not configured for the data interface, and correspondingly, the incoming parameter is not optimized.
203. And calling the data interface according to the optimized input parameters to execute the data acquisition task.
In some embodiments, if the problems of communication failure, network connection interruption and the like occur in the process of updating the data of the data source system, the server may acquire abnormal data, and accordingly, after the data is acquired, the method further comprises the steps of performing preliminary detection on the acquired data to judge whether the acquired data has abnormal conditions such as a main key being empty or not, if the acquired data has abnormal conditions, adjusting the incoming parameters according to the content of an increment field in the abnormal data to obtain new incoming parameters, or taking the content of the increment field in the abnormal data as the new incoming parameters, generating a new data acquisition task based on the new incoming parameters, and re-acquiring the data from the data source system by executing the new data acquisition task to complement the abnormal data.
For example, for the second interface type, three time points are defined as T A、TB、TC, and the time point T B is between the time points T A and T C, the incoming parameter of the data acquisition task is a time period taking T A as the starting time point and T C as the ending time point, and accordingly, the data acquisition task is used for acquiring data updated by the data source system in the time period from T A to T C.
After data are collected, if the data with the update time in the time period from T B to T C have abnormal conditions such as empty main keys, the starting time point T A in the incoming parameters of the data collection task is adjusted to be T B, the starting time point T C is kept unchanged to obtain new incoming parameters, a new data collection task is generated based on the new incoming parameters, the new data collection task is used for collecting the data updated in the time period from T B to T C, and the updated data in the time period from T B to T C are collected again from the data source system by executing the new data collection task so as to complement the abnormal data;
If only the data updated at the time point T B has abnormal conditions such as that the main key is empty, the incoming parameters of the data acquisition task are adjusted to the time point T B, a new data acquisition task is generated based on the new incoming parameters, the new data acquisition task is used for acquiring the data updated at the time point T B, and the data updated at the time point T B are re-acquired from the data source system by executing the new data acquisition task so as to complement the abnormal data. It should be noted that the interface type of the data interface for transmitting data described above is changed from the second interface type to the fourth interface type.
204. Based on the interface type of the data interface, performing corresponding data integrity check on the acquired data, if the integrity check is passed, executing step 205, and if the integrity check is not passed, re-executing the data acquisition task.
In some embodiments, the step 204 includes obtaining a target data integrity check scheme from a mapping relationship between an interface type and a data integrity check scheme based on the interface type of the data interface, and configuring the target data integrity check scheme for the data interface, where the target data integrity check scheme is a data integrity check scheme corresponding to the interface type.
For different interface types, different data integrity check schemes are configured, and the data integrity check schemes are described below based on the different interface types.
In the first case, the interface type is a third interface type, and the third interface type refers to an interface with a version number as an increment field.
For the third interface type, since the data in the data source system has a unique version number corresponding to the data, whether the data has data integrity can be determined by checking whether the version number is absent. Correspondingly, the data integrity check of the collected data based on the interface type of the data interface comprises the steps that for a third interface type, after the server collects the data, whether the version number corresponding to the data is absent or not is checked, if the version number in the data is continuous and absent, the data is complete, the data integrity check is passed, step 205 is executed, if the version number in the data is absent, the data is missing, the data integrity check is not passed, and the data collection task is required to be executed again so as to collect the data corresponding to the absent version number.
In the second case, the interface type is a fifth interface type, and the fifth interface type refers to a data interface supporting a check code.
For the fifth interface type, the data in the data source system carries a check code, which is called a first check code, and after the server collects the data, the check code of the data is recalculated, which is called a second check code, and the first check code and the second check code are compared to determine whether the data has data integrity. Correspondingly, the data integrity check of the collected data based on the interface type of the data interface comprises the steps that for a fifth interface type, after the server collects the data, a second check code of the data is calculated, if the second check code is identical to a first check code of the data, the data is complete, the data integrity check passes, step 205 is executed, if the second check code is different from the first check code of the data, the data is missing, the data integrity check does not pass, and the data collection task is needed to be executed again so as to collect the data of which the second check code is different from the first check code of the data.
By the method provided in the step 204, different data integrity verification schemes are selected according to the interface type of the data interface, and the data integrity verification is performed on the collected data, so that the data integrity is ensured, the data quality is improved, and the data reliability is enhanced.
205. And comparing the acquired data with the data in the buffer area to remove the repeated data.
Based on the above-mentioned steps 201-203, the data acquisition task is performed, and repeated data may occur in the acquired data, where the repeated data may affect the efficiency of the subsequent data processing, so that the acquired data is compared with the data in the buffer area to remove the repeated data.
In some embodiments, the buffer is used to temporarily store all collected data from the respective data source system. The temporary storage means that the data stored in the buffer area is eliminated based on an elimination algorithm. In some embodiments, the elimination algorithm eliminates according to a first-in-first-out rule, and for each data source, the data that is last stored in the buffer by the data source system is not eliminated until the next time the data is stored in the buffer by the data source system.
In some embodiments, the size of the space of the buffer may be dynamically adjusted according to the last data acquisition amount, and the data stored in the buffer is a part of the data in the database.
In some embodiments, the above data deduplication includes two implementations:
The first implementation mode is that for the third interface type, since the data in the data source system has the unique version number corresponding to the data, whether repeated data exists or not can be judged based on the version number of the collected data, the step of comparing the collected data with the data in the cache area to remove the repeated data comprises the steps of comparing the version number of the collected data with the version number of the data in the cache area one by one, removing the repeated data in the collected data if the version number of the collected data is the same as the version number of the data in the cache area, and reserving the data in the collected data if the version number of the collected data is different from the version number of the data in the cache area.
In the second implementation manner, for other interface types, whether repeated data exist or not can be judged based on collected data primary keys, wherein the primary keys refer to fields in the data, which can uniquely identify the data, and the primary keys have uniqueness to the data. The step of comparing the collected data with the data in the cache area to remove the repeated data comprises the steps of comparing the primary key of the collected data with the primary key of the data in the cache area one by one, removing the repeated data in the collected data if the primary key of the collected data is identical to the primary key of the data in the cache area, and reserving the data in the collected data if the primary key of the collected data is different from the primary key of the data in the cache area.
206. And storing the data subjected to duplication removal into a database, and storing the data subjected to duplication removal into a cache area.
In some embodiments, after the collected data is deduplicated, the deduplicated data is stored in a database, and the deduplicated data is stored in a buffer area in preparation for deduplication of the data next time.
In the embodiment of the application, the reliability enhancement strategies are automatically configured for the data interfaces with different interface types, so that data is not missed when the data acquisition task is executed, data missing is avoided, the integrity of the data is enhanced, and for the data source with multiple copies, the method can ensure the consistency of the data among the multiple copies of the data source, further, the data integrity check is carried out on the acquired data on the basis of the reliability enhancement strategies, the integrity of the data is further ensured, the data is de-duplicated, and the data quality is improved. By comprehensively using the method, the reliability of the data is ensured.
Fig. 4 is a block diagram of a data acquisition device according to an embodiment of the present application. The apparatus is used for executing the data acquisition method described above, referring to fig. 4, and the apparatus includes:
The input parameter optimization module 401 is configured to optimize, in response to receiving an input parameter of a data acquisition task, the input parameter based on a target reliability enhancement policy of a data interface, where the target reliability enhancement policy is an input parameter processing manner corresponding to an interface type to which the data interface belongs;
a data interface calling module 402, configured to call the data interface according to the optimized input parameter, so as to execute the data acquisition task;
The data processing module 403 is configured to process the collected data, and store the processed data in a database.
In one possible implementation, the apparatus further includes:
The policy configuration module is configured to configure a target reliability enhancement policy for a data interface of a data management platform based on an interface type to which the data interface belongs, where the target reliability enhancement policy is an incoming parameter processing mode corresponding to the interface type to which the data interface belongs.
In one possible implementation manner, the policy configuration module is configured to, for a first interface type, where the first interface type refers to a data interface using natural time as an increment field, configure a first reliability enhancement policy for the data interface if a data acquisition protocol supported by the data interface is data with a transmission update time greater than the incoming parameter, where the first reliability enhancement policy is a subtraction of a time unit from the incoming parameter of the data acquisition task;
And for a second interface type, the second interface type refers to a data interface taking a start-stop time period as an increment field, a data acquisition protocol supported by the data interface is used for transmitting data with updated time in the start-stop time period, and a second reliability enhancement strategy is configured for the data interface, and is used for respectively moving forwards and backwards for a specified time step by taking the incoming parameter as a base line.
In one possible implementation manner, the input parameter optimization module is configured to subtract a time unit from the input parameter of the data acquisition task for a first interface type if a data acquisition protocol supported by the data interface is data with a transmission update time longer than the input parameter;
For the second interface type, the data acquisition protocol supported by the data interface is to transmit data with updated time in a start-stop time period, and then the data acquisition protocol is respectively moved forwards and backwards by a specified time step by taking the incoming parameter as a base line.
In one possible implementation manner, the data processing module is configured to perform a corresponding data integrity check on the collected data based on an interface type of the data interface, perform a data deduplication step if the integrity check is passed, and re-perform the data collection task if the integrity check is not passed;
and comparing the acquired data with the data in the buffer area to remove the repeated data.
In one possible implementation, the apparatus further includes:
the scheme configuration module is used for acquiring a target data integrity check scheme from a mapping relation between the interface type and a data integrity check scheme based on the interface type of the data interface, and configuring the target data integrity check scheme for the data interface, wherein the target data integrity check scheme is the data integrity check scheme corresponding to the interface type.
In one possible implementation manner, the data processing module is configured to, for a third interface type, compare the version number of the collected data with the version number of the data in the buffer area one by one, remove duplicate data in the collected data if the version number of the collected data is the same as the version number of the data in the buffer area, and reserve the data in the collected data if the version number of the collected data is different from the version number of the data in the buffer area;
And for other interface types, comparing the main key of the acquired data with the main key of the data in the cache area one by one, if the main key of the acquired data is the same as the main key of the data in the cache area, removing repeated data in the acquired data, and if the main key of the acquired data is different from the main key of the data in the cache area, reserving the data in the acquired data.
In the embodiment of the application, the reliability enhancement strategies are automatically configured for the data interfaces with different interface types, so that data is not missed when the data acquisition task is executed, data missing is avoided, the integrity of the data is enhanced, and for the data source with multiple copies, the method can ensure the consistency of the data among the multiple copies of the data source, further, the data integrity check is carried out on the acquired data on the basis of the reliability enhancement strategies, the integrity of the data is further ensured, the data is de-duplicated, and the data quality is improved. By comprehensively using the method, the reliability of the data is ensured.
It should be noted that, when the data acquisition device provided in the above embodiment runs an application program, only the division of the above functional modules is used for illustration, in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the data acquisition device and the data acquisition method provided in the foregoing embodiments belong to the same concept, and specific implementation processes thereof are detailed in the method embodiments and are not described herein again.
Optionally, the above data collection method may be implemented by an architecture as shown in fig. 5, and fig. 5 is a block diagram of a system architecture of a data collection method according to an embodiment of the present application, including a task scheduling unit 501 and a data collection unit 502, where the data collection unit 502 includes an interface identification unit 503, a reliability enhancement policy application unit 504, a data integrity check unit 505, a data deduplication unit 506, a data cache unit 507, and a data warehouse unit 508.
The task scheduling unit 501 is configured to schedule the data acquisition task according to a fixed scheduling rule by using the server to start the data acquisition task.
The interface identifying unit 503 is configured to identify an interface type of the data interface, where the interface type is determined by two modes of automatic identification or manual selection, and the specific process is described in step 201 above.
The reliability enhancement policy application unit 504 is configured to configure a target reliability enhancement policy for the data interface based on an interface type to which the data interface of the data management platform belongs, optimize an incoming parameter based on the target reliability enhancement policy of the data interface, and call the data interface according to the optimized incoming parameter to execute the data acquisition task, where a specific process is shown in steps 201-203.
The data integrity checking unit 505 is configured to perform corresponding data integrity checking on the collected data based on the interface type of the data interface, if the integrity checking is passed, execute the step 205, and if the integrity checking is not passed, re-execute the data collection task, where the specific process is shown in the step 204.
The data deduplication unit 506 is configured to compare the collected data with the data in the buffer area to remove duplicate data, and the specific process is shown in step 205.
And the data buffer unit 507 is used for storing the data subjected to the duplication removal into a buffer area so as to prepare for the next duplication removal of the data.
And the data storage unit 508 is used for storing the duplicate removed data into a database.
In the embodiment of the present application, the computer device can be configured as a server, and the server is used as an execution body to implement the technical solution provided in the embodiment of the present application.
Fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application, where the server 600 may have a relatively large difference due to different configurations or performances, and may include one or more processors (Central Processing Units, CPUs) 601 and one or more memories 602, where at least one computer program is stored in the memories 602, and the at least one computer program is loaded and executed by the processor 601 to implement the data collection method provided in the above method embodiments. Of course, the server may also have a wired or wireless network interface, a keyboard, an input/output interface, and other components for implementing the functions of the device, which are not described herein.
Processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like. The processor 601 may be implemented in at least one hardware form of DSP (DIGITAL SIGNAL Processing), FPGA (Field-Programmable gate array), PLA (Programmable Logic Array ). The processor 601 may also include a main processor, which is a processor for processing data in a wake-up state, also called a CPU (Central Processing Unit ), and a coprocessor, which is a low-power processor for processing data in a standby state. In some embodiments, the processor 601 may be integrated with a GPU (Graphics Processing Unit, image processor) for rendering and drawing of content required to be displayed by the display screen. In some embodiments, the processor 601 may also include an AI (ARTIFICIAL INTELLIGENCE ) processor for processing computing operations related to machine learning.
The memory 602 may include one or more computer-readable storage media, which may be non-transitory. The memory 602 may also include high-speed random access memory as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in memory 602 is used to store at least one computer program for execution by processor 601 to implement the data acquisition method provided by the method embodiments of the present application.
Those skilled in the art will appreciate that the architecture shown in fig. 6 is not limiting as to the computer device 600, and may include more or fewer components than shown, or may combine certain components, or employ a different arrangement of components.
Embodiments of the present application also provide a computer-readable storage medium having stored therein at least one section of a computer program that is loaded and executed by a processor of a computer device to implement the operations performed by the computer device in the methods of the embodiments described above. For example, the computer readable storage medium may be Read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM), magnetic tape, floppy disk, optical data storage device, and the like.
In some embodiments, a computer program according to an embodiment of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site or on multiple computer devices distributed across multiple sites and interconnected by a communication network, where the multiple computer devices distributed across multiple sites and interconnected by a communication network may constitute a blockchain system.
Embodiments of the present application also provide a computer program product or computer program comprising computer program code stored in a computer readable storage medium. The computer program code is read from a computer readable storage medium by a processor of a computer device, which executes the computer program code, causing the computer device to perform the data acquisition methods provided in the various alternative implementations described above.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims (10)

1. A method of data acquisition, the method comprising:
In response to receiving an incoming parameter of a data acquisition task, optimizing the incoming parameter based on a target reliability enhancement strategy of a data interface, wherein the target reliability enhancement strategy is an incoming parameter processing mode corresponding to an interface type to which the data interface belongs;
calling the data interface according to the optimized input parameters to execute the data acquisition task;
and processing the acquired data and storing the processed data into a database.
2. The method according to claim 1, wherein the method further comprises:
Based on the interface type of the data interface of the data management platform, configuring a target reliability enhancement strategy for the data interface, wherein the target reliability enhancement strategy is an incoming parameter processing mode corresponding to the interface type of the data interface.
3. The method of claim 2, wherein configuring the target reliability enhancement policy for the data interface based on the interface type to which the data interface of the data management platform belongs comprises:
For a first interface type, the first interface type refers to a data interface taking natural time as an increment field, if a data acquisition protocol supported by the data interface is data with transmission update time longer than the input parameter, a first reliability enhancement strategy is configured for the data interface, and the first reliability enhancement strategy is obtained by subtracting a time unit from the input parameter of the data acquisition task;
And for a second interface type, the second interface type refers to a data interface taking a start-stop time period as an increment field, a data acquisition protocol supported by the data interface is used for transmitting data with updated time in the start-stop time period, and a second reliability enhancement strategy is configured for the data interface, and is used for respectively moving forwards and backwards for a specified time step by taking the incoming parameter as a base line.
4. The method of claim 1, wherein the optimizing the incoming parameters based on a target reliability enhancement policy of the data interface in response to receiving the incoming parameters of the data acquisition task comprises:
For the first interface type, if the data acquisition protocol supported by the data interface is data with transmission update time longer than the input parameter, subtracting a time unit from the input parameter of the data acquisition task;
For the second interface type, the data acquisition protocol supported by the data interface is to transmit data with updated time in a start-stop time period, and then the data acquisition protocol is respectively moved forwards and backwards by a specified time step by taking the incoming parameter as a base line.
5. The method of claim 1, wherein processing the collected data comprises:
Based on the interface type of the data interface, carrying out corresponding data integrity check on the acquired data, if the integrity check is passed, executing a data deduplication step, and if the integrity check is not passed, re-executing the data acquisition task;
and comparing the acquired data with the data in the buffer area to remove the repeated data.
6. The method of claim 5, wherein the method further comprises:
Based on the interface type of the data interface, a target data integrity check scheme is obtained from a mapping relation between the interface type and a data integrity check scheme, and the target data integrity check scheme is configured for the data interface, wherein the target data integrity check scheme is the data integrity check scheme corresponding to the interface type.
7. The method of claim 5, wherein comparing the collected data with the data in the buffer to remove duplicate data comprises:
For a third interface type, the third interface type refers to a data interface taking a version number as an increment field, the version number of the collected data is compared with the version number of the data in the cache area one by one, if the version number of the collected data is the same as the version number of the data in the cache area, repeated data in the collected data are removed, and if the version number of the collected data is different from the version number of the data in the cache area, the data in the collected data are reserved;
And for other interface types, comparing the main key of the acquired data with the main key of the data in the cache area one by one, if the main key of the acquired data is the same as the main key of the data in the cache area, removing repeated data in the acquired data, and if the main key of the acquired data is different from the main key of the data in the cache area, reserving the data in the acquired data.
8. A data acquisition device, the device comprising:
The data interface comprises an incoming parameter optimization module, a data interface and a data interface, wherein the incoming parameter optimization module is used for responding to the received incoming parameters of the data acquisition task and optimizing the incoming parameters based on a target reliability enhancement strategy of the data interface, and the target reliability enhancement strategy is an incoming parameter processing mode corresponding to the interface type of the data interface;
The data interface calling module is used for calling the data interface according to the optimized input parameters so as to execute the data acquisition task;
the data processing module is used for processing the acquired data and storing the processed data into the database.
9. A computer device, characterized in that it comprises a processor and a memory for storing at least one computer program, which is loaded by the processor and which performs the data acquisition method according to any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium is adapted to store at least one computer program for performing the data acquisition method of any one of claims 1 to 7.
CN202310946867.5A 2023-07-28 2023-07-28 Data collection method, device, computer equipment and storage medium Active CN119441325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310946867.5A CN119441325B (en) 2023-07-28 2023-07-28 Data collection method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310946867.5A CN119441325B (en) 2023-07-28 2023-07-28 Data collection method, device, computer equipment and storage medium

Publications (2)

Publication Number Publication Date
CN119441325A true CN119441325A (en) 2025-02-14
CN119441325B CN119441325B (en) 2025-08-26

Family

ID=94516779

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310946867.5A Active CN119441325B (en) 2023-07-28 2023-07-28 Data collection method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN119441325B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105681100A (en) * 2016-02-29 2016-06-15 北京英诺威尔科技股份有限公司 Multi-protocol acquiring and scheduling method for comprehensive network management
US9582371B1 (en) * 2015-12-09 2017-02-28 International Business Machines Corporation Balancing latency and consistency requirements during data replication
CN113032221A (en) * 2021-03-30 2021-06-25 深圳红途创程科技有限公司 Data acquisition and transmission method and device, computer equipment and storage medium
CN115914050A (en) * 2022-11-04 2023-04-04 杭州熙羚信息技术有限公司 A method and system for collecting network traffic based on policy delivery

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9582371B1 (en) * 2015-12-09 2017-02-28 International Business Machines Corporation Balancing latency and consistency requirements during data replication
CN105681100A (en) * 2016-02-29 2016-06-15 北京英诺威尔科技股份有限公司 Multi-protocol acquiring and scheduling method for comprehensive network management
CN113032221A (en) * 2021-03-30 2021-06-25 深圳红途创程科技有限公司 Data acquisition and transmission method and device, computer equipment and storage medium
CN115914050A (en) * 2022-11-04 2023-04-04 杭州熙羚信息技术有限公司 A method and system for collecting network traffic based on policy delivery

Also Published As

Publication number Publication date
CN119441325B (en) 2025-08-26

Similar Documents

Publication Publication Date Title
CN113656503A (en) Data synchronization method, device and system and computer readable storage medium
CN115374102A (en) Data processing method and system
CN109634989B (en) HIVE task execution engine selection method and system
CN113778994B (en) Database detection method, apparatus, electronic device and computer readable medium
CN107943618B (en) Data quick recovery method of simulation system and storage medium
US9026493B1 (en) Multi-master RDBMS improvements for distributed computing environment
CN119441325B (en) Data collection method, device, computer equipment and storage medium
CN110634073B (en) Transaction freezing method and system
CN110659281A (en) Hive-based data processing method and device, computer equipment and storage medium
CN115185787A (en) Method and device for processing transaction log
CN114138786A (en) An online transaction message deduplication method, device, medium, product and equipment
CN113553488A (en) Method and device for updating index data in search engine, electronic equipment and medium
CN114647578B (en) System testing method, device, equipment and storage medium
CN114253924A (en) Synchronization method, synchronization equipment and storage medium
CN119831745B (en) Transaction data processing method, device, storage medium and electronic device
CN110716726A (en) Client updating method and device, electronic equipment and computer readable storage medium
CN115987760B (en) Service process guarding method and dual-machine service system in dual-machine mode
US20250110642A1 (en) Method, device, and computer program product for upgrading virtual storage system
CN115454725B (en) Database backup methods, devices, electronic equipment and storage media
CN120832387A (en) Data synchronization method, apparatus, electronic device, storage medium, and program product
CN114090688B (en) A data synchronization system, method and device for microservice architecture
CN119883737A (en) Data backup method, device, equipment and medium
CN120832341A (en) Data processing method, device, equipment and computer-readable storage medium
CN119088880A (en) Data synchronization method, device, electronic device and storage medium
CN119622045A (en) Bloodline information processing method, device, electronic device and readable medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant