[go: up one dir, main page]

CN113127462A - Integrated big data management platform based on life cycle management - Google Patents

Integrated big data management platform based on life cycle management Download PDF

Info

Publication number
CN113127462A
CN113127462A CN202010030011.XA CN202010030011A CN113127462A CN 113127462 A CN113127462 A CN 113127462A CN 202010030011 A CN202010030011 A CN 202010030011A CN 113127462 A CN113127462 A CN 113127462A
Authority
CN
China
Prior art keywords
data
management
submodule
storage
life cycle
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010030011.XA
Other languages
Chinese (zh)
Inventor
苏志坚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lianyang Guorong Beijing Technology Co ltd
Original Assignee
Lianyang Guorong Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lianyang Guorong Beijing Technology Co ltd filed Critical Lianyang Guorong Beijing Technology Co ltd
Priority to CN202010030011.XA priority Critical patent/CN113127462A/en
Publication of CN113127462A publication Critical patent/CN113127462A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2219Large Object storage; Management thereof
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24532Query optimisation of parallel queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Quality & Reliability (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于生命周期管理的一体化大数据管理平台,包括元数据管理子系统,用于对各类数据的元数据进行集中的存储、管理和维护;数据接入子系统,用于数据的接入;数据存储管理子系统,用于数据的持久化存储;数据检索子系统,用于数据的检索,根据用户所需查询的数据特征信息与所述元数据管理子系统交互,查找与数据特征信息一致的数据;运维管理子系统,用于数据的可视化管理,用于提供管理员所需的数据全貌及任务概况;消息层,用于消息中间件,用于提供数据传送的分布式环境;资源调度,用于各种资源进行合理有效的调节和测量及分析和使用;安全机制,用于数据的安全。有益效果:实现平台整体“高内聚、低耦合”的效果。

Figure 202010030011

The invention discloses an integrated big data management platform based on life cycle management, including a metadata management subsystem, which is used for centralized storage, management and maintenance of metadata of various data; data access; data storage management subsystem, used for persistent data storage; data retrieval subsystem, used for data retrieval, interacts with the metadata management subsystem according to the data feature information that the user needs to query, Find data consistent with data feature information; operation and maintenance management subsystem, used for visual data management, to provide the data overview and task overview required by administrators; message layer, used for message middleware, used to provide data transmission distributed environment; resource scheduling, used for reasonable and effective adjustment and measurement, analysis and use of various resources; security mechanism, used for data security. Beneficial effect: Realize the effect of "high cohesion and low coupling" of the platform as a whole.

Figure 202010030011

Description

Integrated big data management platform based on life cycle management
Technical Field
The invention relates to the technical field of big data, in particular to an integrated big data management platform based on life cycle management.
Background
The big data is a huge data set which is collected from a plurality of sources in a multi-element form, is often real-time, has significance associated with network behaviors which are increasingly popularized by human beings, is collected by related departments and enterprises, and contains data with real intention, preference and non-traditional structure and significance of data producers.
With the rapid development of society, data expands rapidly, magnitude is expanding continuously, and presents the characteristics of various data types, large data volume, low value density, high speed, high timeliness and the like, and the existing technical architecture and route can not process the massive data efficiently. How to realize the efficient access, storage, management and retrieval of mass data becomes a great technical challenge in the development and transformation process of enterprises or organizational services. Therefore, an integrated big data management platform with the capability of efficiently connecting, storing, managing and using based on the data life cycle is needed.
An effective solution to the problems in the related art has not been proposed yet.
Disclosure of Invention
The invention aims to provide an integrated big data management platform based on life cycle management to solve the problems in the background technology.
The technical scheme of the invention is realized as follows:
according to one aspect of the invention, an integrated big data management platform based on life cycle management is provided.
The integrated big data management platform based on life cycle management comprises:
the metadata management subsystem is used for carrying out centralized storage, management and maintenance on the metadata of various data;
the data access subsystem is used for accessing data and performing automatic partition calculation on the data by combining metadata;
the data storage management subsystem is used for persistently storing the data and optimizing the storage after the data falls to the ground;
the data retrieval subsystem is used for retrieving data, interacting with the metadata management subsystem according to the data characteristic information required to be inquired by a user, and searching data consistent with the data characteristic information;
the operation and maintenance management subsystem is used for the visual management of data and providing data overview and task overview required by an administrator;
a message layer for message middleware for providing a distributed environment for data transfer;
resource scheduling, which is used for reasonably and effectively adjusting, measuring, analyzing and using various resources;
and the safety mechanism is used for data safety and authenticating the administrator.
The data access subsystem comprises a data loading submodule, a message queue submodule, a data persistence submodule and a data storage submodule, wherein the data loading submodule is used for supporting common network protocols including HTTP, TCP and FTP, also supporting message middleware such as Kafka and RocktMQ of a message layer, and starting an HTTP Server, a TCP Server, an Ftp Server and the like and is used for accessing data from different clients; the message queue submodule is used for high-speed buffering and multi-source aggregation of data; the data persistence sub-module is used for appointing theme consumption data from the message queue module, supporting a user to sort the data into different channels according to business rules and finally landing the data; and the data storage submodule is used for storing data.
The data storage submodule comprises a distributed data warehouse, a distributed KV library and a distributed full library.
The data storage management subsystem comprises a small file merging submodule, a life cycle management submodule and a grading storage submodule, wherein the small file merging submodule is used for performing merging tasks on small files to enable a plurality of small files to be merged to generate a large file; the life cycle management submodule is used for deleting the expiration of the stored data and supporting automatic data deletion of a distributed data warehouse Hive and a distributed full-text library Elasticissearch; and the hierarchical storage submodule is used for hierarchically storing the data.
The operation and maintenance management subsystem comprises a deployment submodule, a configuration submodule, a management submodule, a monitoring submodule, a data full-face submodule and a task monitoring submodule, wherein the deployment submodule is used for service deployment; the configuration submodule is used for service configuration; the management submodule is used for managing the nodes and the services, such as adding and deleting the nodes or the services on line, modifying the service configuration on line and the like; the monitoring submodule is used for monitoring the health condition of the cluster, comprehensively monitoring various set indexes and system running conditions, monitoring a hardware server network, a memory, a disk and the like in real time, and monitoring the memory use and active state of a service in real time; the data overall view submodule is used for monitoring the data overall view; and the task monitoring submodule is used for monitoring the task general profile.
According to another aspect of the present invention, there is provided an integrated big data management method based on lifecycle management, including the steps of:
the data access subsystem receives user stored data;
the data access subsystem processes the stored data and determines the characteristics of the stored data;
the data access subsystem stores the data according to the stored data characteristics;
according to a preset small file threshold value, the data storage management subsystem merges the stored data within the threshold value;
and deleting the expired storage data by the data management subsystem according to the preset storage time period of the storage data.
When the data access subsystem processes the storage data and determines the characteristics of the storage data, information such as a data mode, a partition rule, a storage rule and the like of metadata can be configured in advance, and the data persistence sub-module performs automatic partition calculation on the storage data by combining the metadata to know the target partition of the storage data.
When the data storage management subsystem merges the stored data within the threshold value according to the preset small file threshold value, the small file threshold value can be set, the small file merging submodule judges the stored data, a merging task is carried out on the stored data lower than the threshold value, merging of the stored data is completed through a Spark operation, a large file is finally generated, and delayed deletion is carried out on the original stored data.
When the data management subsystem deletes expired storage data according to a preset storage time period of the storage data, a data storage life cycle is defined in metadata, the storage data is judged, and the life cycle management submodule automatically deletes the storage data exceeding the life cycle.
Compared with the prior art, the invention has the following beneficial effects:
(1) the metadata management subsystem is used as a center to define metadata, partition rules and storage rules of various service data, the data access subsystem, the data storage management subsystem and the data retrieval subsystem are all based on the metadata, the internal logic of the subsystems is focused, and the subsystems are connected, stored, managed and used without interaction relation, so that the effect of high cohesion and low coupling of the whole platform is realized.
(2) The multi-path parallel data access realizes the partition automatic calculation, a reasonable file closing strategy, the file size control and the data timeliness are considered, and the data access efficiency is greatly improved; the small files are combined, so that random IO is effectively reduced, the retrieval efficiency is improved, and the pressure of file system metadata management is reduced; the data is stored in a grading way, so that the storage of cold and hot data is more reasonable, and the query efficiency and the data storage cost of the online service are considered; the expired data is automatically deleted, and the storage pressure can be effectively reduced under the scene of rapid expansion of information, so that the management of mass data is more leisurely; transparent partition cutting based on a user-defined partition rule effectively reduces the scanning amount of the whole table, improves the retrieval efficiency and reduces the query response time under the condition filtering scene based on a large number of hidden partition fields; through visual deployment, configuration and monitoring, the technical threshold of an administrator on large-scale data cluster management is lowered, and the working efficiency is greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.
FIG. 1 is a block diagram of an integrated big data management platform based on lifecycle management according to an embodiment of the present invention;
FIG. 2 is a diagram of the logical architecture of an integrated big data management platform based on lifecycle management according to an embodiment of the present invention;
FIG. 3 is a logic architecture diagram of a data access subsystem implementation principle in an integrated big data management platform based on lifecycle management according to an embodiment of the present invention;
FIG. 4 is a logic architecture diagram of a realization principle of merging of small files and deleting of outdated data in an integrated big data management platform based on lifecycle management according to an embodiment of the present invention;
FIG. 5 is a flowchart of an integrated big data management method based on lifecycle management according to an embodiment of the present invention
Detailed Description
The invention is further described with reference to the following drawings and detailed description:
according to the embodiment of the invention, an integrated big data management platform based on life cycle management is provided.
As shown in fig. 1 to 4, the lifecycle management-based integrated big data management platform includes:
the metadata management subsystem is used for carrying out centralized storage, management and maintenance on the metadata of various data;
the data access subsystem is used for accessing data and performing automatic partition calculation on the data by combining metadata;
the data storage management subsystem is used for persistently storing the data and optimizing the storage after the data falls to the ground;
the data retrieval subsystem is used for retrieving data, interacting with the metadata management subsystem according to the data characteristic information required to be inquired by a user, and searching data consistent with the data characteristic information;
the operation and maintenance management subsystem is used for the visual management of data and providing data overview and task overview required by an administrator;
a message layer for message middleware for providing a distributed environment for data transfer;
resource scheduling, which is used for reasonably and effectively adjusting, measuring, analyzing and using various resources;
and the safety mechanism is used for data safety and authenticating the administrator.
As shown in fig. 2, in the above scheme, the metadata management subsystem is configured to perform centralized storage, management, and maintenance on metadata of various types of data, where it is to be noted that the metadata includes information such as a data mode, a partition rule, and a storage rule, and it is also to be noted that the partition rule is in a self-defined manner, and includes an equivalent partition, a range partition, a hash partition, a time partition, and the like, and can flexibly satisfy multiple service scenarios;
it should be emphasized that the metadata management subsystem is a unified metadata management center, and performs unified storage, management and maintenance of metadata on various types of data managed in the integrated big data platform. Other subsystems only need to interact with the metadata management subsystem, so that the mode of the data, the format of the data, where the data is stored, how the data is partitioned, how long the data is stored and the like can be identified, and the interaction with other subsystems in the upstream and the downstream is not needed, so that the effect of coupling understanding is achieved.
As shown in fig. 2-3, correspondingly, in the above scheme, the data access subsystem includes a data loading submodule, a message queue submodule, a data persistence submodule and a data storage submodule, where the data loading submodule is configured to support common network protocols including HTTP, TCP, and FTP, and also support message middleware such as Kafka and rocktmq in a message layer, and can start an HTTP Server, a TCP Server, an FTP Server, and is configured to access data from different clients; the message queue submodule is used for high-speed buffering and multi-source aggregation of data; the data persistence sub-module is used for appointing theme consumption data from the message queue module, supporting a user to sort the data into different channels according to business rules and finally landing the data; and the data storage submodule is used for storing data. The data storage submodule comprises a distributed data warehouse, a distributed KV library and a distributed full library.
The data access subsystem is responsible for accessing data, performs automatic partition calculation on the data by combining metadata, can identify partition rules from the metadata during actual application, automatically corresponds each piece of data to an actual target partition according to partition calculation logic, correctly calculates the partition and stores the partition to the target partition, and is an operation basis for deleting expired data and performing transparent partition cutting by the data retrieval subsystem;
in addition, in order to meet a high-availability scene, the data access subsystem provides a load balancing module for HTTP service, has the capabilities of fault detection and automatic switching, and can ensure that the access of important data is not interrupted due to single-point faults;
in addition, the data loading submodule has the functions of user authentication, authority verification and the like inside, meets the application scene with high safety, and has the functions of data verification and data combination. Multipath parallel loading can be started in a single instance, so that the data access efficiency is improved; meanwhile, the interior is sorted according to a user-defined rule, and the sorted data are sent to different channels, and finally the data are imported into a specified theme of a message queue;
in addition, the data persistence sub-module is used for appointing the subject consumption data from the message queue module, supporting the user to sort the data into different channels according to the business rules, also used for the final landing of data, which comprises the processes of data deserialization, partition calculation according to partition rules, partition file selection and file writing and the like, according to the configuration of a user, data can fall to a ground distributed data warehouse, a distributed KV library and a distributed full library, in order to ensure the high efficiency of data access, a plurality of data files can be opened simultaneously under the same partition, a data persistence module can determine when to close the files according to the size of the files and the idle waiting time, and the user can properly adjust the two parameters according to the characteristics of online services, so that the size of the ground files can be reasonably controlled, and the timeliness of data inquiry can be considered.
As shown in fig. 2 and fig. 4, correspondingly, in the above scheme, the data storage management subsystem includes a small file merging submodule, a life cycle management submodule, and a hierarchical storage submodule, where the small file merging submodule is configured to perform a merging task on a small file, so that a plurality of small files are merged to generate a large file; the life cycle management submodule is used for deleting the expiration of the stored data and supporting automatic data deletion of a distributed data warehouse Hive and a distributed full-text library Elasticissearch; the hierarchical storage submodule is responsible for hierarchical storage of data;
the small file merging submodule automatically identifies small files (the size of the small files is lower than a designated threshold value and is called as small files based on metadata and a file system) by relying on a Spark distributed computing engine, a user can configure the threshold value, a plurality of small files are organized into a merging task, merging of the files is completed through Spark operation after the task is issued, and each merging task finally generates a large file; meanwhile, the corresponding small files can be automatically moved to the recovery directory;
there are three features through small file merging: firstly, the file merging strategy is configured through metadata, so that the file merging strategy is flexible and convenient; secondly, the metadata is updated after merging, and repeated reading of the data is avoided; thirdly, data before merging is deleted in a delayed mode, abnormal reading of running tasks is avoided, and in the method, the shortest merging period can reach the level of minutes;
for the life cycle management submodule, a user is supported to automatically delete data of a distributed data warehouse Hive and a distributed full-text library Elasticissearch, namely the data in the latest period is stored in a big data platform, and the expired data is deleted; different storage periods are specified for different tables, and the time granularity can reach the day level.
As shown in fig. 2, in the above solution, the data retrieval subsystem provides the capability of structured retrieval, and interacts with the client based on the standard SQL syntax and the standard JDBC interface; the system has two working modes of ad hoc query and offline retrieval; the optimization is carried out from two aspects of task scheduling and task execution. Not only the query efficiency is improved, but also the stable operation of the system is ensured under the condition of ensuring the concurrency of the system as much as possible;
based on a self-defined partition rule, in the retrieval process, implicit partition conditions are automatically identified and automatically converted into standard partition filtering, the data scanning amount of single query can be effectively reduced, and the data retrieval subsystem ensures the stable operation of the system and realizes the priority scheduling of tasks under high concurrency through mechanisms such as concurrency control, priority control and the like;
the data retrieval subsystem is mainly optimized on transparent partition cutting, in the SQL execution planning stage, the subsystem can automatically identify the implicit partition conditions in the SQL statement, namely, a user does not need to specify partition fields during query, the subsystem can identify whether the filter conditions contain the filter fields from which the table partition fields derive or not based on the self-defined partition rules in the metadata, if the filter conditions contain the fields, the condition values and the matching characters are automatically converted into standard partition filtering according to the partition rules, the partition cutting purpose is achieved, 5 priority queues are provided, and task priority scheduling is achieved.
As shown in fig. 2, correspondingly, in the above scheme, the operation and maintenance management subsystem includes a deployment submodule, a configuration submodule, a management submodule, a monitoring submodule, a data full-view submodule, and a task monitoring submodule, where the deployment submodule is used for service deployment; the configuration submodule is used for service configuration; the management submodule is used for managing the nodes and the services, such as adding and deleting the nodes or the services on line, modifying the service configuration on line and the like; the monitoring submodule is used for monitoring the health condition of the cluster, comprehensively monitoring various set indexes and system running conditions, monitoring a hardware server network, a memory, a disk and the like in real time, and monitoring the memory use and active state of a service in real time; the data overall view submodule is used for monitoring the data overall view; the task monitoring submodule is used for monitoring the task general profile;
in addition, the operation and maintenance management subsystem can monitor software and hardware information of the platform, including the state of cluster service, the state of each host of the cluster, the availability and the usage of a CPU, a disk, a memory, a network and the like in a node, and can monitor indexes of disk IO, network IO, data storage and the like of the whole cluster.
According to another aspect of the embodiment of the invention, an integrated big data management method based on life cycle management is provided.
As shown in fig. 5, the lifecycle management-based integrated big data management platform includes the following steps:
step S101, a data access subsystem receives user storage data;
step S103, the data access subsystem processes the storage data and determines the characteristics of the storage data;
step S105, the data access subsystem stores the data according to the storage data characteristics;
step S107, according to the preset small file threshold, the data storage management subsystem merges the stored data in the threshold;
step S109, according to the preset storage time period of the storage data, the data management subsystem deletes the expired storage data.
When the data access subsystem processes the storage data and determines the characteristics of the storage data, information such as a data mode, a partition rule, a storage rule and the like of metadata can be configured in advance, and the data persistence sub-module performs automatic partition calculation on the storage data by combining the metadata to know the target partition of the storage data.
When the data storage management subsystem merges the stored data within the threshold value according to the preset small file threshold value, the small file threshold value can be set, the small file merging submodule judges the stored data, a merging task is carried out on the stored data lower than the threshold value, merging of the stored data is completed through a Spark operation, a large file is finally generated, and delayed deletion is carried out on the original stored data.
When the data management subsystem deletes expired storage data according to a preset storage time period of the storage data, a data storage life cycle is defined in metadata, the storage data is judged, and the life cycle management submodule automatically deletes the storage data exceeding the life cycle.
In summary, according to the technical scheme of the present invention, the metadata management subsystem is used as a center to define metadata, partition rules and storage rules of various service data, and the data access subsystem, the data storage management subsystem and the data retrieval subsystem all use this as a basis to focus internal logic of the subsystems, and "connect, store, manage and use" the subsystems have no interaction relationship with each other, so as to achieve the effect of "high cohesion and low coupling" of the whole platform, access multi-channel parallel data, achieve partition automatic calculation, a reasonable file closing strategy, take into account file size control and data timeliness, and greatly improve data access efficiency; the small files are combined, so that random IO is effectively reduced, the retrieval efficiency is improved, and the pressure of file system metadata management is reduced; the data is stored in a grading way, so that the storage of cold and hot data is more reasonable, and the query efficiency and the data storage cost of the online service are considered; the expired data is automatically deleted, and the storage pressure can be effectively reduced under the scene of rapid expansion of information, so that the management of mass data is more leisurely; transparent partition cutting based on a user-defined partition rule effectively reduces the scanning amount of the whole table, improves the retrieval efficiency and reduces the query response time under the condition filtering scene based on a large number of hidden partition fields; through visual deployment, configuration and monitoring, the technical threshold of an administrator on large-scale data cluster management is lowered, and the working efficiency is greatly improved.
Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (9)

1. An integrated big data management platform based on life cycle management is characterized by comprising:
the metadata management subsystem is used for carrying out centralized storage, management and maintenance on the metadata of various data;
the data access subsystem is used for accessing data and performing automatic partition calculation on the data by combining metadata;
the data storage management subsystem is used for persistently storing the data and optimizing the storage after the data falls to the ground;
the data retrieval subsystem is used for retrieving data, interacting with the metadata management subsystem according to the data characteristic information required to be inquired by a user, and searching data consistent with the data characteristic information;
the operation and maintenance management subsystem is used for the visual management of data and providing data overview and task overview required by an administrator;
a message layer for message middleware for providing a distributed environment for data transfer;
resource scheduling, which is used for reasonably and effectively adjusting, measuring, analyzing and using various resources;
and the safety mechanism is used for data safety and authenticating the administrator.
2. The integrated big data management platform based on life cycle management of claim 1, wherein the data access subsystem comprises a data loading submodule, a message queue submodule, a data persistence submodule and a data storage submodule, wherein,
the data loading submodule is responsible for supporting common network protocols including HTTP, TCP and FTP, also supports message middleware such as Kafka and RocktMQ of a message layer, can start an HTTP Server, a TCP Server, an Ftp Server and the like, and is used for accessing data from different clients;
the message queue submodule is used for high-speed buffering and multi-source aggregation of data;
the data persistence sub-module is used for appointing theme consumption data from the message queue module, supporting a user to sort the data into different channels according to business rules and finally landing the data;
and the data storage submodule is used for storing data.
3. The integrated big data management platform based on life cycle management according to claim 2, wherein the data storage sub-modules comprise a distributed data warehouse, a distributed KV library and a distributed full library.
4. The integrated big data management platform based on life cycle management of claim 3, wherein the data storage management subsystem comprises a small file merging submodule, a life cycle management submodule and a hierarchical storage submodule, wherein,
the small file merging submodule is used for performing merging tasks on the small files so as to merge a plurality of small files to generate a large file;
the life cycle management submodule is used for deleting the expiration of the stored data and supporting automatic data deletion of a distributed data warehouse Hive and a distributed full-text library Elasticissearch;
and the hierarchical storage submodule is responsible for hierarchical storage of the data.
5. The integrated big data management platform based on life cycle management according to claim 4, wherein the operation and maintenance management subsystem comprises a deployment submodule, a configuration submodule, a management submodule, a monitoring submodule, a data full-view submodule and a task monitoring submodule, wherein,
the deployment submodule is used for service deployment;
the configuration submodule is used for service configuration;
the management submodule is used for managing the nodes and the services, such as adding and deleting the nodes or the services on line, modifying the service configuration on line and the like;
the monitoring submodule is used for monitoring the health condition of the cluster, comprehensively monitoring various set indexes and system running conditions, monitoring a hardware server network, a memory, a disk and the like in real time, and monitoring the memory use and active state of a service in real time;
the data overall view submodule is used for monitoring the data overall view;
and the task monitoring submodule is used for monitoring the task general profile.
6. An integrated big data management method based on life cycle management, which is used for the integrated big data management platform based on life cycle management in claim 5, and comprises the following steps:
the data access subsystem receives user stored data;
the data access subsystem processes the stored data and determines the characteristics of the stored data;
the data access subsystem stores the data according to the stored data characteristics;
according to a preset small file threshold value, the data storage management subsystem merges the stored data within the threshold value;
and deleting the expired storage data by the data management subsystem according to the preset storage time period of the storage data.
7. The integrated big data management method based on life cycle management as claimed in claim 6, wherein the data access subsystem processes the stored data, and determining the characteristics of the stored data comprises:
pre-configuring information such as a data mode, a partition rule, a storage rule and the like of metadata;
and in combination with the metadata, the data persistence sub-module performs automatic partition calculation on the storage data to know the target partition of the storage data.
8. The integrated big data management method based on life cycle management as claimed in claim 7, wherein the merging, by the data storage management subsystem, the stored data within the threshold value according to the preset small file threshold value comprises:
setting a small file threshold, and judging the stored data by a small file merging submodule;
performing a merging task on the stored data lower than a threshold value;
the storage data are merged through a Spark operation, and a large file is finally generated;
and carrying out delayed deletion on the original storage data.
9. The integrated big data management method based on life cycle management as claimed in claim 1, wherein according to a preset storage time period of the storage data, the deleting of the expired storage data by the data management subsystem comprises:
defining a data storage lifecycle in the metadata;
judging the stored data;
and automatically deleting the stored data exceeding the life cycle by the life cycle management submodule.
CN202010030011.XA 2020-01-10 2020-01-10 Integrated big data management platform based on life cycle management Pending CN113127462A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010030011.XA CN113127462A (en) 2020-01-10 2020-01-10 Integrated big data management platform based on life cycle management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010030011.XA CN113127462A (en) 2020-01-10 2020-01-10 Integrated big data management platform based on life cycle management

Publications (1)

Publication Number Publication Date
CN113127462A true CN113127462A (en) 2021-07-16

Family

ID=76771053

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010030011.XA Pending CN113127462A (en) 2020-01-10 2020-01-10 Integrated big data management platform based on life cycle management

Country Status (1)

Country Link
CN (1) CN113127462A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896098A (en) * 2022-04-29 2022-08-12 重庆大学 A data fault tolerance method and distributed storage system
CN118277134A (en) * 2024-06-04 2024-07-02 北京友友天宇系统技术有限公司 Data processing method and system based on distributed message queue
WO2024167207A1 (en) * 2023-02-08 2024-08-15 에스케이 주식회사 Rdb-based rtdb management method and system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003560B1 (en) * 1999-11-03 2006-02-21 Accenture Llp Data warehouse computing system
US20120317164A1 (en) * 2009-12-30 2012-12-13 Zte Corporation Services Cloud System and Service Realization Method
CN105243443A (en) * 2015-11-16 2016-01-13 国网天津市电力公司 Performance optimization method for large enterprise unstructured platform
CN106203828A (en) * 2016-07-11 2016-12-07 浪潮软件集团有限公司 Data management platform based on data full life cycle management
CN109272155A (en) * 2018-09-11 2019-01-25 郑州向心力通信技术股份有限公司 A kind of corporate behavior analysis system based on big data

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7003560B1 (en) * 1999-11-03 2006-02-21 Accenture Llp Data warehouse computing system
US20120317164A1 (en) * 2009-12-30 2012-12-13 Zte Corporation Services Cloud System and Service Realization Method
CN105243443A (en) * 2015-11-16 2016-01-13 国网天津市电力公司 Performance optimization method for large enterprise unstructured platform
CN106203828A (en) * 2016-07-11 2016-12-07 浪潮软件集团有限公司 Data management platform based on data full life cycle management
CN109272155A (en) * 2018-09-11 2019-01-25 郑州向心力通信技术股份有限公司 A kind of corporate behavior analysis system based on big data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114896098A (en) * 2022-04-29 2022-08-12 重庆大学 A data fault tolerance method and distributed storage system
CN114896098B (en) * 2022-04-29 2023-05-05 重庆大学 Data fault tolerance method and distributed storage system
WO2024167207A1 (en) * 2023-02-08 2024-08-15 에스케이 주식회사 Rdb-based rtdb management method and system
CN118277134A (en) * 2024-06-04 2024-07-02 北京友友天宇系统技术有限公司 Data processing method and system based on distributed message queue

Similar Documents

Publication Publication Date Title
US10642840B1 (en) Filtered hash table generation for performing hash joins
Santos et al. Real-time data warehouse loading methodology
Luo et al. On performance stability in LSM-based storage systems (extended version)
CN104657459B (en) A kind of mass data storage means based on file granularity
CN111352925B (en) Policy driven data placement and information lifecycle management
US8762369B2 (en) Optimized data stream management system
US9438665B1 (en) Scheduling and tracking control plane operations for distributed storage systems
CN113127462A (en) Integrated big data management platform based on life cycle management
US11537616B1 (en) Predicting query performance for prioritizing query execution
CN116108057B (en) Distributed database access method, device, equipment and storage medium
CN103617276A (en) Method for storing distributed hierarchical RDF data
US20170262508A1 (en) Infrastructure management system having scalable storage architecture
Zhi et al. Research of Hadoop-based data flow management system
CN108509507A (en) The account management system and its implementation of unified entrance
Feng et al. Ccindex for cassandra: A novel scheme for multi-dimensional range queries in cassandra
US9898614B1 (en) Implicit prioritization to rate-limit secondary index creation for an online table
Irie et al. A novel automated tiered storage architecture for achieving both cost saving and qoe
CN113312345B (en) Remote sensing data storage system, storage and retrieval method combining Kubernetes and Ceph
CN109460345A (en) The calculation method and system of real time data
Liu Corpus‐Based Japanese Reading Teaching Database Cloud Service Model
CN118093949A (en) Structure tree generation method, device, computer equipment and storage medium
Rasool et al. Replica placement in multi-tier data grid
JP2006146615A (en) Object-related information management program, management method, and management apparatus.
CN112235356B (en) Distributed PB-level CFD simulation data management system based on cluster
CN115202979A (en) A kind of SQL real-time monitoring method, system, electronic device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20210716