Disclosure of Invention
The invention aims to provide an integrated big data management platform based on life cycle management to solve the problems in the background technology.
The technical scheme of the invention is realized as follows:
according to one aspect of the invention, an integrated big data management platform based on life cycle management is provided.
The integrated big data management platform based on life cycle management comprises:
the metadata management subsystem is used for carrying out centralized storage, management and maintenance on the metadata of various data;
the data access subsystem is used for accessing data and performing automatic partition calculation on the data by combining metadata;
the data storage management subsystem is used for persistently storing the data and optimizing the storage after the data falls to the ground;
the data retrieval subsystem is used for retrieving data, interacting with the metadata management subsystem according to the data characteristic information required to be inquired by a user, and searching data consistent with the data characteristic information;
the operation and maintenance management subsystem is used for the visual management of data and providing data overview and task overview required by an administrator;
a message layer for message middleware for providing a distributed environment for data transfer;
resource scheduling, which is used for reasonably and effectively adjusting, measuring, analyzing and using various resources;
and the safety mechanism is used for data safety and authenticating the administrator.
The data access subsystem comprises a data loading submodule, a message queue submodule, a data persistence submodule and a data storage submodule, wherein the data loading submodule is used for supporting common network protocols including HTTP, TCP and FTP, also supporting message middleware such as Kafka and RocktMQ of a message layer, and starting an HTTP Server, a TCP Server, an Ftp Server and the like and is used for accessing data from different clients; the message queue submodule is used for high-speed buffering and multi-source aggregation of data; the data persistence sub-module is used for appointing theme consumption data from the message queue module, supporting a user to sort the data into different channels according to business rules and finally landing the data; and the data storage submodule is used for storing data.
The data storage submodule comprises a distributed data warehouse, a distributed KV library and a distributed full library.
The data storage management subsystem comprises a small file merging submodule, a life cycle management submodule and a grading storage submodule, wherein the small file merging submodule is used for performing merging tasks on small files to enable a plurality of small files to be merged to generate a large file; the life cycle management submodule is used for deleting the expiration of the stored data and supporting automatic data deletion of a distributed data warehouse Hive and a distributed full-text library Elasticissearch; and the hierarchical storage submodule is used for hierarchically storing the data.
The operation and maintenance management subsystem comprises a deployment submodule, a configuration submodule, a management submodule, a monitoring submodule, a data full-face submodule and a task monitoring submodule, wherein the deployment submodule is used for service deployment; the configuration submodule is used for service configuration; the management submodule is used for managing the nodes and the services, such as adding and deleting the nodes or the services on line, modifying the service configuration on line and the like; the monitoring submodule is used for monitoring the health condition of the cluster, comprehensively monitoring various set indexes and system running conditions, monitoring a hardware server network, a memory, a disk and the like in real time, and monitoring the memory use and active state of a service in real time; the data overall view submodule is used for monitoring the data overall view; and the task monitoring submodule is used for monitoring the task general profile.
According to another aspect of the present invention, there is provided an integrated big data management method based on lifecycle management, including the steps of:
the data access subsystem receives user stored data;
the data access subsystem processes the stored data and determines the characteristics of the stored data;
the data access subsystem stores the data according to the stored data characteristics;
according to a preset small file threshold value, the data storage management subsystem merges the stored data within the threshold value;
and deleting the expired storage data by the data management subsystem according to the preset storage time period of the storage data.
When the data access subsystem processes the storage data and determines the characteristics of the storage data, information such as a data mode, a partition rule, a storage rule and the like of metadata can be configured in advance, and the data persistence sub-module performs automatic partition calculation on the storage data by combining the metadata to know the target partition of the storage data.
When the data storage management subsystem merges the stored data within the threshold value according to the preset small file threshold value, the small file threshold value can be set, the small file merging submodule judges the stored data, a merging task is carried out on the stored data lower than the threshold value, merging of the stored data is completed through a Spark operation, a large file is finally generated, and delayed deletion is carried out on the original stored data.
When the data management subsystem deletes expired storage data according to a preset storage time period of the storage data, a data storage life cycle is defined in metadata, the storage data is judged, and the life cycle management submodule automatically deletes the storage data exceeding the life cycle.
Compared with the prior art, the invention has the following beneficial effects:
(1) the metadata management subsystem is used as a center to define metadata, partition rules and storage rules of various service data, the data access subsystem, the data storage management subsystem and the data retrieval subsystem are all based on the metadata, the internal logic of the subsystems is focused, and the subsystems are connected, stored, managed and used without interaction relation, so that the effect of high cohesion and low coupling of the whole platform is realized.
(2) The multi-path parallel data access realizes the partition automatic calculation, a reasonable file closing strategy, the file size control and the data timeliness are considered, and the data access efficiency is greatly improved; the small files are combined, so that random IO is effectively reduced, the retrieval efficiency is improved, and the pressure of file system metadata management is reduced; the data is stored in a grading way, so that the storage of cold and hot data is more reasonable, and the query efficiency and the data storage cost of the online service are considered; the expired data is automatically deleted, and the storage pressure can be effectively reduced under the scene of rapid expansion of information, so that the management of mass data is more leisurely; transparent partition cutting based on a user-defined partition rule effectively reduces the scanning amount of the whole table, improves the retrieval efficiency and reduces the query response time under the condition filtering scene based on a large number of hidden partition fields; through visual deployment, configuration and monitoring, the technical threshold of an administrator on large-scale data cluster management is lowered, and the working efficiency is greatly improved.
Detailed Description
The invention is further described with reference to the following drawings and detailed description:
according to the embodiment of the invention, an integrated big data management platform based on life cycle management is provided.
As shown in fig. 1 to 4, the lifecycle management-based integrated big data management platform includes:
the metadata management subsystem is used for carrying out centralized storage, management and maintenance on the metadata of various data;
the data access subsystem is used for accessing data and performing automatic partition calculation on the data by combining metadata;
the data storage management subsystem is used for persistently storing the data and optimizing the storage after the data falls to the ground;
the data retrieval subsystem is used for retrieving data, interacting with the metadata management subsystem according to the data characteristic information required to be inquired by a user, and searching data consistent with the data characteristic information;
the operation and maintenance management subsystem is used for the visual management of data and providing data overview and task overview required by an administrator;
a message layer for message middleware for providing a distributed environment for data transfer;
resource scheduling, which is used for reasonably and effectively adjusting, measuring, analyzing and using various resources;
and the safety mechanism is used for data safety and authenticating the administrator.
As shown in fig. 2, in the above scheme, the metadata management subsystem is configured to perform centralized storage, management, and maintenance on metadata of various types of data, where it is to be noted that the metadata includes information such as a data mode, a partition rule, and a storage rule, and it is also to be noted that the partition rule is in a self-defined manner, and includes an equivalent partition, a range partition, a hash partition, a time partition, and the like, and can flexibly satisfy multiple service scenarios;
it should be emphasized that the metadata management subsystem is a unified metadata management center, and performs unified storage, management and maintenance of metadata on various types of data managed in the integrated big data platform. Other subsystems only need to interact with the metadata management subsystem, so that the mode of the data, the format of the data, where the data is stored, how the data is partitioned, how long the data is stored and the like can be identified, and the interaction with other subsystems in the upstream and the downstream is not needed, so that the effect of coupling understanding is achieved.
As shown in fig. 2-3, correspondingly, in the above scheme, the data access subsystem includes a data loading submodule, a message queue submodule, a data persistence submodule and a data storage submodule, where the data loading submodule is configured to support common network protocols including HTTP, TCP, and FTP, and also support message middleware such as Kafka and rocktmq in a message layer, and can start an HTTP Server, a TCP Server, an FTP Server, and is configured to access data from different clients; the message queue submodule is used for high-speed buffering and multi-source aggregation of data; the data persistence sub-module is used for appointing theme consumption data from the message queue module, supporting a user to sort the data into different channels according to business rules and finally landing the data; and the data storage submodule is used for storing data. The data storage submodule comprises a distributed data warehouse, a distributed KV library and a distributed full library.
The data access subsystem is responsible for accessing data, performs automatic partition calculation on the data by combining metadata, can identify partition rules from the metadata during actual application, automatically corresponds each piece of data to an actual target partition according to partition calculation logic, correctly calculates the partition and stores the partition to the target partition, and is an operation basis for deleting expired data and performing transparent partition cutting by the data retrieval subsystem;
in addition, in order to meet a high-availability scene, the data access subsystem provides a load balancing module for HTTP service, has the capabilities of fault detection and automatic switching, and can ensure that the access of important data is not interrupted due to single-point faults;
in addition, the data loading submodule has the functions of user authentication, authority verification and the like inside, meets the application scene with high safety, and has the functions of data verification and data combination. Multipath parallel loading can be started in a single instance, so that the data access efficiency is improved; meanwhile, the interior is sorted according to a user-defined rule, and the sorted data are sent to different channels, and finally the data are imported into a specified theme of a message queue;
in addition, the data persistence sub-module is used for appointing the subject consumption data from the message queue module, supporting the user to sort the data into different channels according to the business rules, also used for the final landing of data, which comprises the processes of data deserialization, partition calculation according to partition rules, partition file selection and file writing and the like, according to the configuration of a user, data can fall to a ground distributed data warehouse, a distributed KV library and a distributed full library, in order to ensure the high efficiency of data access, a plurality of data files can be opened simultaneously under the same partition, a data persistence module can determine when to close the files according to the size of the files and the idle waiting time, and the user can properly adjust the two parameters according to the characteristics of online services, so that the size of the ground files can be reasonably controlled, and the timeliness of data inquiry can be considered.
As shown in fig. 2 and fig. 4, correspondingly, in the above scheme, the data storage management subsystem includes a small file merging submodule, a life cycle management submodule, and a hierarchical storage submodule, where the small file merging submodule is configured to perform a merging task on a small file, so that a plurality of small files are merged to generate a large file; the life cycle management submodule is used for deleting the expiration of the stored data and supporting automatic data deletion of a distributed data warehouse Hive and a distributed full-text library Elasticissearch; the hierarchical storage submodule is responsible for hierarchical storage of data;
the small file merging submodule automatically identifies small files (the size of the small files is lower than a designated threshold value and is called as small files based on metadata and a file system) by relying on a Spark distributed computing engine, a user can configure the threshold value, a plurality of small files are organized into a merging task, merging of the files is completed through Spark operation after the task is issued, and each merging task finally generates a large file; meanwhile, the corresponding small files can be automatically moved to the recovery directory;
there are three features through small file merging: firstly, the file merging strategy is configured through metadata, so that the file merging strategy is flexible and convenient; secondly, the metadata is updated after merging, and repeated reading of the data is avoided; thirdly, data before merging is deleted in a delayed mode, abnormal reading of running tasks is avoided, and in the method, the shortest merging period can reach the level of minutes;
for the life cycle management submodule, a user is supported to automatically delete data of a distributed data warehouse Hive and a distributed full-text library Elasticissearch, namely the data in the latest period is stored in a big data platform, and the expired data is deleted; different storage periods are specified for different tables, and the time granularity can reach the day level.
As shown in fig. 2, in the above solution, the data retrieval subsystem provides the capability of structured retrieval, and interacts with the client based on the standard SQL syntax and the standard JDBC interface; the system has two working modes of ad hoc query and offline retrieval; the optimization is carried out from two aspects of task scheduling and task execution. Not only the query efficiency is improved, but also the stable operation of the system is ensured under the condition of ensuring the concurrency of the system as much as possible;
based on a self-defined partition rule, in the retrieval process, implicit partition conditions are automatically identified and automatically converted into standard partition filtering, the data scanning amount of single query can be effectively reduced, and the data retrieval subsystem ensures the stable operation of the system and realizes the priority scheduling of tasks under high concurrency through mechanisms such as concurrency control, priority control and the like;
the data retrieval subsystem is mainly optimized on transparent partition cutting, in the SQL execution planning stage, the subsystem can automatically identify the implicit partition conditions in the SQL statement, namely, a user does not need to specify partition fields during query, the subsystem can identify whether the filter conditions contain the filter fields from which the table partition fields derive or not based on the self-defined partition rules in the metadata, if the filter conditions contain the fields, the condition values and the matching characters are automatically converted into standard partition filtering according to the partition rules, the partition cutting purpose is achieved, 5 priority queues are provided, and task priority scheduling is achieved.
As shown in fig. 2, correspondingly, in the above scheme, the operation and maintenance management subsystem includes a deployment submodule, a configuration submodule, a management submodule, a monitoring submodule, a data full-view submodule, and a task monitoring submodule, where the deployment submodule is used for service deployment; the configuration submodule is used for service configuration; the management submodule is used for managing the nodes and the services, such as adding and deleting the nodes or the services on line, modifying the service configuration on line and the like; the monitoring submodule is used for monitoring the health condition of the cluster, comprehensively monitoring various set indexes and system running conditions, monitoring a hardware server network, a memory, a disk and the like in real time, and monitoring the memory use and active state of a service in real time; the data overall view submodule is used for monitoring the data overall view; the task monitoring submodule is used for monitoring the task general profile;
in addition, the operation and maintenance management subsystem can monitor software and hardware information of the platform, including the state of cluster service, the state of each host of the cluster, the availability and the usage of a CPU, a disk, a memory, a network and the like in a node, and can monitor indexes of disk IO, network IO, data storage and the like of the whole cluster.
According to another aspect of the embodiment of the invention, an integrated big data management method based on life cycle management is provided.
As shown in fig. 5, the lifecycle management-based integrated big data management platform includes the following steps:
step S101, a data access subsystem receives user storage data;
step S103, the data access subsystem processes the storage data and determines the characteristics of the storage data;
step S105, the data access subsystem stores the data according to the storage data characteristics;
step S107, according to the preset small file threshold, the data storage management subsystem merges the stored data in the threshold;
step S109, according to the preset storage time period of the storage data, the data management subsystem deletes the expired storage data.
When the data access subsystem processes the storage data and determines the characteristics of the storage data, information such as a data mode, a partition rule, a storage rule and the like of metadata can be configured in advance, and the data persistence sub-module performs automatic partition calculation on the storage data by combining the metadata to know the target partition of the storage data.
When the data storage management subsystem merges the stored data within the threshold value according to the preset small file threshold value, the small file threshold value can be set, the small file merging submodule judges the stored data, a merging task is carried out on the stored data lower than the threshold value, merging of the stored data is completed through a Spark operation, a large file is finally generated, and delayed deletion is carried out on the original stored data.
When the data management subsystem deletes expired storage data according to a preset storage time period of the storage data, a data storage life cycle is defined in metadata, the storage data is judged, and the life cycle management submodule automatically deletes the storage data exceeding the life cycle.
In summary, according to the technical scheme of the present invention, the metadata management subsystem is used as a center to define metadata, partition rules and storage rules of various service data, and the data access subsystem, the data storage management subsystem and the data retrieval subsystem all use this as a basis to focus internal logic of the subsystems, and "connect, store, manage and use" the subsystems have no interaction relationship with each other, so as to achieve the effect of "high cohesion and low coupling" of the whole platform, access multi-channel parallel data, achieve partition automatic calculation, a reasonable file closing strategy, take into account file size control and data timeliness, and greatly improve data access efficiency; the small files are combined, so that random IO is effectively reduced, the retrieval efficiency is improved, and the pressure of file system metadata management is reduced; the data is stored in a grading way, so that the storage of cold and hot data is more reasonable, and the query efficiency and the data storage cost of the online service are considered; the expired data is automatically deleted, and the storage pressure can be effectively reduced under the scene of rapid expansion of information, so that the management of mass data is more leisurely; transparent partition cutting based on a user-defined partition rule effectively reduces the scanning amount of the whole table, improves the retrieval efficiency and reduces the query response time under the condition filtering scene based on a large number of hidden partition fields; through visual deployment, configuration and monitoring, the technical threshold of an administrator on large-scale data cluster management is lowered, and the working efficiency is greatly improved.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.