CN119740653A

CN119740653A - Metadata management method and system based on large model

Info

Publication number: CN119740653A
Application number: CN202411800707.0A
Authority: CN
Inventors: 孙露; 李扬; 李慧娟; 吴士伟; 李晓; 孙浩; 陈通
Original assignee: Shandong Ecloud Information Technology Co ltd
Current assignee: Shandong Ecloud Information Technology Co ltd
Priority date: 2024-12-09
Filing date: 2024-12-09
Publication date: 2025-04-01

Abstract

The present invention belongs to the field of metadata management technology. A metadata management method and system based on a big model is proposed. By integrating the big model technology into the entire metadata management process, the functions of intelligent completion of metadata information, intelligent analysis of blood relationship, intelligent query of metadata, and intelligent number finding are realized, which greatly improves the efficiency and accuracy of metadata management; by visually displaying data blood relationship, intelligently recommending related data objects, etc., it provides users with a more convenient and intelligent data consumption experience, reduces manual intervention and error rate, and reduces data management costs; through intelligent metadata management, data is easier to understand and use, which helps to tap the potential value of data and provide strong support for enterprise decision-making.

Description

Metadata management method and system based on large model

Technical Field

The invention relates to the technical field of metadata management, in particular to a metadata management method and system based on a large model.

Background

The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.

With the rapid development of big data technology, the enterprise data size is increased explosively, and the data types are also diversified. Metadata plays a critical role in data management, data management and data utilization as "data" for data. However, the conventional metadata management method has a number of disadvantages.

On the one hand, due to the non-standardization of source system design and management, enterprises can only acquire the most basic technical metadata information, such as database names, schema names, table names, field names and the like, when acquiring metadata. Such information, while important, is far from meeting the needs of enterprises for deep understanding and utilization of data. In order to perfect other core metadata information, such as Chinese names, service caliber, description and labels of a table, chinese names, description and value description of fields, sensitivity level and the like, enterprises need to input a great deal of manpower and time.

On the other hand, with the development of business and technology, data processing logic is increasingly complex, database types are various, processing scripts are various, and great challenges are brought to data blood-source analysis. The traditional blood margin analysis method is difficult to deal with the complex scenes, so that the success rate and the accuracy rate of the blood margin analysis are not high.

Meanwhile, the traditional metadata management and retrieval depend on a structured organization form and a keyword retrieval method, and the method is effective in a scene with less data volume and lower service complexity, and can help users to quickly locate required data resources. For example, in early data map construction, conventional keyword retrieval has played an important role. However, with the continuous development of big data technology and the increasing complexity of business demands, the data scale presents explosive growth, and the data structure also becomes diversified day by day, in this case, the traditional keyword retrieval mode starts to expose obvious limitations, so that users are submerged in a large number of irrelevant or redundant documents and schemas, and a great deal of time and effort are required to screen and read.

In addition, when users have a question about the specific meaning, use or structure of data, they often need to review complex documents or rely on manual consultation with data producers or developers, which is not only inefficient, but also misleading due to poor communication.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a metadata management method and system based on a large model, which automatically fills other core metadata information based on the most basic technical metadata information and service sample data, reduces the labor investment and time cost, and improves the success rate and accuracy of data blood-edge analysis.

In order to achieve the above purpose, the present invention adopts the following technical scheme:

in a first aspect, the present invention provides a method for metadata management based on a large model.

A metadata management method based on a big model, comprising the following processes:

Designing a meta model and a meta data knowledge graph schema to generate a meta data knowledge graph database;

Selecting a base large model, and performing fine tuning training and verification of the base large model according to an instruction data set and a prompt word template to obtain an intelligent metadata filling large model, wherein the intelligent metadata filling large model is used for performing metadata filling when metadata is changed;

Analyzing a data processing script in data development by using a data blood-edge intelligent analysis large model, and analyzing blood-edge relations among tables, tables and tasks and between the tables and API;

Writing the blood-edge relation obtained by interpreting the data blood-edge intelligent analysis big model into a metadata knowledge graph database, updating metadata filled by the metadata intelligent filling big model into the metadata knowledge graph database, and writing the metadata information of the data standard, the data index and the data label in the data management platform and the association relation data among the data standard, the data index and the data label, the table and the field into the metadata knowledge graph database to obtain a metadata knowledge graph;

And vectorizing the metadata knowledge graph through a semantic vector model to obtain a metadata vector library, and performing intelligent retrieval and intelligent finding of the number of elements according to the base large model, the metadata knowledge graph and the metadata vector library.

In a second aspect, the present invention provides a large model-based metadata management system.

A metadata management system based on a large model comprises a base large model, a metadata intelligent filling large model and a data blood-edge intelligent analyzing large model;

Metadata intelligent filling big model obtains metadata in multi-source heterogeneous data through a metadata acquisition engine, and data processing script in a data development platform is obtained by data blood-edge intelligent analysis big model, wherein fine tuning training and verification of a base big model are carried out according to an instruction data set and a prompt word template to generate the metadata intelligent filling big model;

The base large model, the metadata intelligent filling large model and the data blood-source intelligent analysis large model are respectively integrated into the existing metadata management flow, and when metadata is changed, the metadata intelligent filling large model is called to carry out metadata filling;

In a third aspect, the present invention provides a computer device comprising a processor and a computer readable storage medium;

a processor adapted to execute a computer program;

a computer readable storage medium having stored therein a computer program which, when executed by the processor, implements a large model based metadata management method according to the first aspect of the present invention.

In a fourth aspect, the present invention provides a computer readable storage medium storing a computer program adapted to be loaded by a processor and to perform a large model based metadata management method according to the first aspect of the present invention.

In a fifth aspect, the present invention provides a computer program product comprising a computer program which, when executed by a processor, implements a big model based metadata management method according to the first aspect of the present invention.

Compared with the prior art, the invention has the beneficial effects that:

1. According to the metadata management method, the large model technology is integrated into the metadata management whole flow, so that the functions of intelligent filling of metadata information, intelligent analysis of blood relationship, intelligent inquiry of metadata, intelligent finding of numbers and the like are realized, and the efficiency and accuracy of metadata management are greatly improved.

2. According to the method, the system and the device, the data blood relationship is visually displayed, the related data objects are intelligently recommended, and the like, so that more convenient and intelligent data consumption experience is provided for a user, the manual intervention and the error rate are reduced, and the data management cost is reduced.

3. The invention facilitates the understanding and utilization of the data through intelligent metadata management, is helpful for mining the potential value of the data and provides powerful support for enterprise decision-making.

Additional aspects of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a schematic diagram of a metadata management method based on a large model according to embodiment 1 of the present invention;

Fig. 2 is a schematic diagram of fine tuning training and verification of the metadata intelligent filling large model provided in embodiment 1 of the present invention.

Detailed Description

The invention will be further described with reference to the drawings and examples.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1:

The implementation mode provides a metadata management method based on a large model so as to realize intelligent management and efficient utilization of metadata. The technical problems to be solved include (1) how to automatically fill other core metadata information based on the most basic technical metadata information and service sample data, and reduce labor investment and time cost, (2) how to improve success rate and accuracy of data blood edge analysis, especially under complex code, heterogeneous database and multi-type processing script scenes, and (3) how to construct and optimize an intelligent data analysis capability by using a large language model so as to thoroughly innovate the existing data query and understanding modes.

The metadata management method of the implementation mode, as shown in figure 1, mainly comprises metadata management, metadata acquisition, metadata management, metadata intelligent retrieval and intelligent finding, and is characterized in that the core of the metadata management method is based on intelligent filling of metadata service information of a large model, intelligent analysis of blood-cause relationship, metadata knowledge graph construction and intelligent semantic retrieval of metadata. More specifically, the following are included:

And 1, designing a meta model.

And performing meta model design and schema design of the metadata knowledge graph according to the requirements of metadata management, and generating a graph database of the metadata knowledge graph. The nodes comprise tables, fields, tasks, APIs, data elements, code sets, data indexes, data labels and the like, the edges comprise upstream and downstream relations, positive and negative correlation relations, association relations and the like, each node and each edge comprise corresponding attribute information such as unique identifiers, names, descriptions and the like of the nodes, and unique identifiers, relationship descriptions, correlation coefficients and the like of the edges.

And 2, selecting a large base model.

Starting from five dimensions of Chinese understanding capability, domain knowledge understanding capability, logic reasoning capability, information generating capability and instruction attaching capability, expert scoring is carried out on the output result of the large model, the optimal large model is selected as a base large model for subsequent processes, and then the base large model is deployed locally, so that the subsequent use is facilitated.

And 3, performing fine tuning training and verification on the metadata intelligent filling large model.

Because enterprise data relates to a great deal of domain knowledge of a company and has specific requirements on output of formats and the like, certain fine adjustment is needed on the basis of a selected base large model to obtain a metadata intelligent filling large model, and then the metadata intelligent filling large model is deployed and integrated into a metadata management flow, as shown in fig. 2.

And 3.1, building a training instruction set and designing a prompt word template.

And (3) combing the metadata information of the standard data set of the stock, wherein the metadata information comprises data sets in national standards, line standards and landmarks and metadata standard data sets in a plurality of bins, and then designing a prompt word template, wherein the prompt word template needs to have definite roles, and the example is as follows:

You are a data management expert and understand the xx business very well. Please answer according to the known information and the questions, and the answer is to return data according to json format specification, and contents other than the questions are not to be interpreted.

The format is:

and finally, a command data set of a standardized question-answer structure, including questions and answers, is obtained according to the prompt word template and the standard data set. Examples of problems with instruction data sets are:

The "PERFTRACE _INFO" table is known, and its English field includes ：TASKID、PROCESSID、SQL、INITVALUE、FINALVALUE、COLUMNTYPE、EXECTORSTATUS、DELNUM、TOTALCOSTS、STATETIMESTAMP、ENDTIMESTAMP、BYTESPEEDPERSEBYTESPEEDPERSECOND、RECORDSPEEDPERSECOND 、TOTALREADRECORDTOTALREADRECORDS 、TOTALERRORRECORDTOTALERRORRECORDS;

Please explain the Chinese name and business meaning of the table of PERFTRACE _INFO, and give the Chinese name and business meaning of each field, the Chinese name of the table, the business meaning of the table, the Chinese name of the field and the business meaning of the field are limited within 200 words, please answer according to the specification, return all the analyses in json format, and the field cannot be deleted;

Answers in the instruction data set sort the data according to the json format specification required.

And 3.2, model fine tuning training and verification.

Based on a base large model, a LORA algorithm is used for carrying out large model fine tuning training on instruction data sets, a field large model capable of automatically generating metadata information is constructed, then some data sets are randomly selected as verification data sets to test the constructed field large model, service experts are invited to carry out manual auditing test results, whether the content quality generated by the field large model meets the standard or not is verified, and if the content quality meets the standard, the field large model which is successfully verified is set as a metadata intelligent filling large model.

And 3.3, deployment and application of the metadata intelligent filling large model.

And integrating the metadata intelligent filling large model into the existing metadata management flow, periodically collecting service system metadata (namely collecting metadata from multi-source heterogeneous data) by a metadata collection engine, and calling the metadata intelligent filling large model to perform metadata filling if the metadata is changed.

And 4, intelligently analyzing the data blood edges.

And analyzing the complex data processing script in the data development platform by using the data blood-edge intelligent analysis large model, and analyzing the blood-edge relationship between the tables, the tables and tasks and the tables and the API. Then, the data blood-edge intelligent analysis big model is integrated into the existing metadata management flow. It can be understood that the data development platform in this implementation manner may preferably adopt a data management platform with the CN116910078a as a core and the data development platform in the implementation method.

And 5, acquiring global metadata information and constructing a metadata knowledge graph.

Writing the blood relationship between the tables analyzed in the step 4 and the tables, the tables and the tasks, and between the tables and the API into the metadata knowledge graph database created in the step 1, updating the metadata supplemented in the step 3 into the metadata knowledge graph database created in the step 1, and writing the metadata information of the data standard, the data index and the data label in the data management platform and the association relationship data between the data standard, the data index and the data label and the tables and the fields into the metadata knowledge graph database created in the step 1. It can be appreciated that a data management platform with metadata as a core and a data management platform in an implementation method can be preferably adopted by the CN 116910078A.

And 6, constructing a metadata vector library.

And vectorizing the metadata knowledge graph in the step 5 through a semantic vector model (for example, an intelligent BGE Chinese and English semantic vector model), so that the use of subsequent steps is facilitated.

And 7, intelligently retrieving the metadata.

In combination with the data map function, the intelligent retrieval through metadata helps the user to accurately find required data based on requirements, and specifically includes:

Step 7.1, problem understanding.

The base large model is utilized to carry out semantic analysis on the user query, the real intention behind the base large model is understood, and the system can accurately capture even if the user expresses the same intention in different modes.

And 7.2, generating a semantic vector.

And 7.1, vectorizing the intention understood by the base large model to obtain a semantic vector, carrying out semantic retrieval according to the semantic vector, and judging the intention of a user according to the semantic retrieval result, thereby providing a search result which is more fit with the requirement.

And 7.3, recommending answer ordering.

And (3) combining the keyword retrieval and the semantic retrieval results in the step (7.2), sequencing the retrieval results according to the relevance, and recommending the answer which meets the user requirements to the front.

And 7.4, model deployment and application.

The metadata intelligent retrieval model is integrated into the existing metadata management flow, and when metadata is retrieved in the data map function, accurate positioning can be performed, and data in a specific scene can be quickly found.

And 8, intelligently finding the number.

The intelligent number finding function helps the user to better apply data, in particular how to make detailed business analysis and definition interpretation on the selected data. The intelligent searching scene aims at the inquiry beyond the simple keyword searching range, the system recognizes and interprets complex data structures and business logic through the intelligent data analysis capability, supports the deep analysis and interpretation of tables, fields, data standards, data indexes and data labels so as to meet the demands of users on data understanding, and specifically comprises the following steps:

step 8.1, problem understanding.

The basic large model is utilized to carry out semantic analysis on the questions of the user, the real requirements behind the questions are understood, and the system can accurately capture even if the user expresses the same intention in different modes.

And 8.2, generating a semantic vector.

Vectorizing the intention understood by the large base model in the step 8.1, generating a semantic vector, carrying out semantic retrieval according to the semantic vector, and judging the intention of the user according to the semantic retrieval result, thereby providing a search result which is more fit with the requirement.

And 8.3, recommending answer ordering.

And (3) the system reorders the result of the semantic retrieval in the step 8.2, and recommends the answer which meets the requirement of the user to the front.

And 8.4, summarizing and refining.

The answers to the queries are summarized by using the base large model, so that key information is briefly and clearly extracted, and a user is helped to find required contents more quickly.

And 8.5, model deployment and application.

The method integrates a base large model (with an intelligent finding function) into the existing metadata management flow, realizes understanding of a complex data structure and business logic, and meets the requirement of users on data understanding through a conversational answer by the intelligent finding function.

And 9, model continuous optimization and iterative upgrading.

Tracking the latest research dynamics, trying new technology, and periodically retraining or fine-tuning a basal large model, an intelligent metadata filling large model and an intelligent data blood-edge analyzing large model.

Example 2:

The implementation mode provides a metadata management system based on a large model, which comprises a base large model, a metadata intelligent filling large model and a data blood-source intelligent analysis large model, wherein each large model can be deployed in one terminal device or distributed, and can be specifically embedded into a metadata management flow, and metadata management is realized by adopting the specific method in the embodiment 1 and is not repeated here.

Example 3:

The present implementation provides an electronic device that includes a processor, a communication interface, and a computer-readable storage medium. Wherein the processor, the communication interface, and the computer-readable storage medium may be connected by a bus or other means.

Wherein the communication interface is for receiving and transmitting data, the computer readable storage medium may be stored in a memory of the electronic device, the computer readable storage medium is for storing a computer program comprising program instructions, and the processor is for executing the program instructions stored by the computer readable storage medium.

A processor, or CPU (Central Processing Unit )) is a computing core as well as a control core of an electronic device, which is adapted to implement one or more instructions, in particular to load and execute one or more instructions to implement a corresponding method flow or a corresponding function.

The processor is configured to perform the following:

The detailed procedure is shown in the specific method in example 1, and will not be described here again.

Example 4:

the present implementation provides a computer-readable storage medium (Memory) that is a Memory device in an electronic device for storing programs and data. It is understood that the computer readable storage medium herein may include both built-in storage media in an electronic device and extended storage media supported by the electronic device. The computer readable storage medium provides a memory space that stores a processing system of the electronic device.

Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. It should be noted that the computer readable storage medium may be a high speed RAM memory, or may be a non-volatile memory, such as at least one magnetic disk memory, or may alternatively be at least one computer readable storage medium located remotely from the foregoing processor.

In one embodiment, the computer readable storage medium has one or more instructions stored therein, and the processor loads and executes the one or more instructions stored in the computer readable storage medium to implement the following:

Example 5:

The present implementation provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the electronic device to perform the following:

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable devices. The computer instructions may be stored in or transmitted across a computer-readable storage medium. The computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be accessed by a computer or data processing device, such as a server, data center, or the like, that contains an integration of one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk (Solid STATE DISK, SSD)), or the like.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for metadata management based on a large model, comprising the following steps:

2. The large model based metadata management method as claimed in claim 1,

And designing a meta model and a meta data knowledge graph schema, wherein the nodes comprise tables, fields, tasks, APIs, data elements, code sets, data indexes and data labels, the edges comprise upstream and downstream relations, positive and negative correlation relations and incidence relations, and each node and each edge comprise corresponding attribute information.

3. The large model based metadata management method as claimed in claim 1,

Based on the base large model, performing base large model fine tuning training on the instruction data set by using a LORA algorithm, constructing a field large model for automatically generating metadata information, randomly selecting the data set as a verification data set to test the constructed field large model, and combining with a manual verification result to obtain a generated metadata intelligent filling large model with the content quality reaching the standard;

And integrating the metadata intelligent filling large model into a metadata management flow, periodically collecting service system metadata by a metadata collection engine, and if the metadata is changed, calling the metadata intelligent filling large model to perform metadata filling.

4. The large model based metadata management method as claimed in claim 1,

Performing intelligent retrieval of the element number, including:

semantic analysis is carried out on the user query by using the base large model, and the real intention behind the user query is understood;

vectorizing the real intention to generate a semantic vector, and carrying out semantic retrieval according to the semantic vector;

And combining the keyword retrieval and semantic retrieval results, sorting the retrieval results according to the relevance, and recommending the answer most conforming to the real intention to the front.

5. The large model based metadata management method as claimed in claim 1,

Performing intelligent finding, including:

semantic analysis is carried out on questions of a user by using a base large model, and the real requirements behind the questions are understood;

vectorizing real demands understood through a base large model, generating semantic vectors, and carrying out semantic retrieval according to the semantic vectors;

re-ordering according to the semantic retrieval result, and recommending the answer which meets the user requirement to the front;

The answers most meeting the user's needs are summarized using a base big model, and key information is extracted to help the user find the desired content.

6. The large model based metadata management method as claimed in claim 1,

And performing expert scoring on output results of different large models according to five dimensions of Chinese understanding capability, domain knowledge understanding capability, logic reasoning capability, information generating capability and instruction attaching capability, and selecting the optimal large model as a base large model.

7. A metadata management system based on a large model is characterized by comprising a base large model, a metadata intelligent filling large model and a data blood-source intelligent analyzing large model;

8. A computer device comprises a processor and a computer-readable storage medium;

a processor adapted to execute a computer program;

A computer readable storage medium having stored therein a computer program which, when executed by the processor, implements the large model based metadata management method according to any of claims 1 to 6.

9. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program adapted to be loaded by a processor and to perform the big model based metadata management method according to any of the claims 1 to 6.

10. A computer program product, characterized in that the computer program product comprises a computer program which, when executed by a processor, implements the big model based metadata management method according to any of the claims 1 to 6.