Disclosure of Invention
The application provides a method, a device and equipment for generating query sql based on hive metadata, and aims to solve the problems that due to the fact that writing of complex sql statements is carried out through a big data assembly, relationships between metadata and data tables need to be maintained, and an operator needs to write complex sql according to the relationships between the tables, and development efficiency is low.
An embodiment of a first aspect of the present application provides a method for generating a query sql based on hive metadata, including the following steps: acquiring metadata information, labeling the metadata information, and storing the metadata information to a preset mysql database to obtain initial metadata; determining field lists of different tables based on the initial metadata, connecting the field lists of the different tables to generate a data table association relation, storing the data table association relation serving as a data model to a preset subject domain, and storing the preset subject domain to a preset mysql database; and acquiring a target data relation in the data model, and generating an sql template for inquiring the target data relation by using a recursive algorithm so as to carry out number extraction development by modifying the sql template.
According to the technical means, the problem that due to the fact that the complex sql statements are written for counting by means of a big data component, the relation between metadata and data tables needs to be maintained, and an operator needs to write the complex sql statements according to the relation between the tables, the development efficiency is low is solved, the maintenance cost of an administrator is reduced, and the development efficiency is improved.
Further, the acquiring metadata information, labeling the metadata information, and storing the labeled metadata information in a preset mysql database includes: acquiring information of an existing database and a data table through a live metastore api (a configuration unit element storage interface); based on the existing database, acquiring the metadata information in the data table information by using the hive scanner api and hdfs (Hadoop Distributed File System); and labeling the metadata information based on preset labeling data, and storing the labeled metadata information to the preset mysql database.
According to the technical means, the information of the existing database and the data table is obtained through the live metastore api, an operable interface is provided, and metadata information can be conveniently marked.
Further, the metadata information includes at least one of a table name, a table remark, table field information, a hard disk space size of table original data and a distributed file system HDFS space occupied by the table data.
According to the technical means, the information of the metadata in the database and the data table is obtained.
Further, the preset labeling data includes: at least one of the names of the characters in the table, the updating period and the preset loading strategy.
According to the technical means, the introduced metadata is labeled by the operable page, so that the metadata is more readable.
Further, the data model is a tree structure composed of a fact table and a plurality of dimension tables, wherein the fact table is a root node.
According to the technical means, the tree structure refers to a data structure with one-to-many tree relationship among data elements, the data structure is clear, and the database query rate is high.
Further, the data table association includes a first association between the fact table and any one of the plurality of dimension tables and a second association between any two of the plurality of dimension tables.
According to the technical means, the consistency and the integrity of the data in the two tables are restricted by establishing connection of the data of the two tables.
An embodiment of a second aspect of the present application provides an apparatus for generating a query sql based on hive metadata, including: the marking module is used for acquiring metadata information, marking the metadata information and storing the metadata information to a preset mysql database to obtain initial metadata; the storage module is used for determining field lists of different tables based on the initial metadata, connecting the field lists of the different tables, generating a data table association relation, storing the data table association relation serving as a data model to a preset subject domain, and storing the preset subject domain to the preset mysql database; and the generation module is used for acquiring the target data relation in the data model, generating an sql template for inquiring the target data relation by using a recursive algorithm and carrying out number-increasing development by modifying the sql template.
Further, the labeling module is specifically configured to: acquiring information of an existing database and a data table through a live metastore api; based on the existing database, acquiring the metadata information in the data table information by using the hive metastore api and the hdfs api; and marking the metadata information based on preset marking data, and storing the marked metadata information to the preset mysql database.
Further, the device for generating query sql based on hive metadata is characterized in that the metadata information includes at least one of a table name, a table remark, table field information, a hard disk space size of table original data, and a distributed file system HDFS space occupied by table data.
Further, the apparatus for generating query sql based on hive metadata is characterized in that the preset annotation data includes: at least one of a table name, an update period, and a preset loading policy.
Further, the apparatus for generating query sql based on hive metadata is characterized in that the data model is a tree structure composed of a fact table and a plurality of dimension tables, where the fact table is a root node.
Further, the apparatus for generating query sql based on hive metadata is characterized in that the data table association relationship includes a first association relationship between the fact table and any one of the plurality of dimension tables and a second association relationship between any two of the plurality of dimension tables.
An embodiment of a third aspect of the present application provides an electronic device, including: a memory, a processor and a computer program stored on the memory and executable on the processor, the processor executing the program to implement the method of generating a query sql based on hive metadata as described in the above embodiments.
A fourth aspect of the present application provides a computer-readable storage medium, on which a computer program is stored, where the computer program is executed by a processor, so as to implement the method for generating a query sql based on hive metadata as described in the foregoing embodiments.
Therefore, initial metadata is obtained by obtaining metadata information, labeling the metadata information and storing the metadata information to a preset mysql database, field lists of different tables are determined based on the initial metadata, the field lists of the different tables are connected, a data table association relation is generated, the data table association relation is stored to a preset subject domain as a data model, the preset subject domain is stored to the preset mysql database, a target data relation in the data model is obtained, an sql template for inquiring the target data relation is generated by using a recursive algorithm, and number extraction development is performed by modifying the sql template. The problem that due to the fact that the complex sql statements are written for number increasing through a big data assembly, the relation between metadata and data tables needs to be maintained, and an operator needs to write the complex sql statements according to the relation between the tables, development efficiency is low is solved, maintenance cost of an administrator is reduced, and development efficiency is improved.
Additional aspects and advantages of the present application will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the present application.
Detailed Description
Reference will now be made in detail to the embodiments of the present application, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary and intended to be used for explaining the present application and should not be construed as limiting the present application.
The following describes a method, an apparatus, and a device for generating a query sql based on hive metadata according to an embodiment of the present application with reference to the drawings. In the method, metadata information is obtained, the metadata information is labeled and then is stored in a preset mysql database to obtain initial metadata, field lists of different tables are determined based on the initial metadata and are connected with the field lists of different tables to generate a data table association relation, the data table association relation is stored to a preset mysql database as a data model, a preset theme domain is stored to the preset mysql database to obtain a target data relation in the data model, a sql template for inquiring the target data relation is generated by a recursive algorithm, and the number extraction development is carried out by modifying the sql template. Therefore, the problem that due to the fact that the writing of complex sql statements is carried out through a large data assembly, the relation between metadata and data tables needs to be maintained, and an operator needs to write the complex sql according to the relation between the tables, the development efficiency is low is solved, the maintenance cost of an administrator is reduced, and the development efficiency is improved.
Specifically, fig. 1 is a flowchart illustrating a method for generating a query sql based on hive metadata according to an embodiment of the present application.
As shown in fig. 1, the method for generating query sql based on hive metadata includes the following steps:
in step S101, metadata information is obtained, and the metadata information is tagged and then stored in a preset mysql database, so as to obtain initial metadata.
The hive is a data warehouse tool based on a distributed system infrastructure Hadoop, is used for data extraction, conversion and loading, and is a mechanism capable of storing, inquiring and analyzing large-scale data stored in the Hadoop; metadata, i.e. hive-defined table name, field name, type, partition, user data, is typically stored in mysql.
Optionally, in some embodiments, obtaining metadata information, labeling the metadata information, and storing the labeled metadata information in a preset mysql database includes: acquiring information of an existing database and a data table through a live metastore api; based on the existing database, using hive metastore api and hdfsapi to obtain the metadata information in the data table information; and marking metadata information based on preset marking data, and storing the marked metadata information to a preset mysql database.
Wherein the metadata information obtained from hive and hdfs and the operator tagged information are stored to mysql.
Optionally, in some embodiments, the metadata information may include at least one of a table name, a table remark, table field information, a size of a hard disk space where table raw data is located, and a distributed file system HDFS space occupied by the table data.
It can be understood that, the information of the existing database and the data table is obtained through the live metastore api, an operable interface is provided, the administrator introduces the metadata of the table by selecting the data table, and the following information is obtained through the live metastore api and the hdfs api: table name, table remark, table field information (field English name, field type, field division, remark), hard disk space size of table original data, and HDFS space size occupied by table data.
Optionally, in some embodiments, the preset annotation data includes: at least one of a table name, an update period, and a preset loading policy.
It should be understood that the operator may then make some annotation on the imported metadata through the operable page, so that the metadata is more readable, and may supplement the following information: table name, update period (day, week, month, year, real time), loading policy (incremental append, full overlay, full snapshot, upsert).
In step S102, based on the initial metadata, field lists of different tables are determined, the field lists of different tables are connected, a data table association relationship is generated, and after the data table association relationship is stored as a data model in a preset subject domain, the preset subject domain is stored in a preset mysql database.
Optionally, in some embodiments, the data model is a tree structure consisting of a fact table and a plurality of dimension tables, wherein the fact table is a root node.
The fact table stores data which can embody actual data or detailed numerical values and generally consists of dimension codes and fact data; the dimension table stores data with independent attributes and hierarchical structures, generally consists of dimension codes and corresponding dimension descriptions (labels), and can be associated with the fact table.
Optionally, in some embodiments, the data table associations include a first association between the fact table and any one of the plurality of dimension tables and a second association between any two of the plurality of dimension tables.
Specifically, a concept of a topic domain is introduced, a data model is located under the topic domain, and a relational graph corresponding to a plurality of tables under one model is shown in fig. 2. A tree structure of a model is defined, which is composed of a fact table and a plurality of dimension tables, and the fact table is used as a root node. Introducing a plurality of tables with mysql which is stored, labeling the table type (a fact table or a dimension table), establishing the association relation between the fields of the fact table and the dimension table and between the dimension table and the dimension table, and storing the association relation in a mysql database. The theme domain is a larger concept, the theme is a little, and the entity is the smallest, wherein the theme is a data model, and the entity represents a relational graph of a plurality of tables.
In step S103, a target data relationship in the data model is obtained, and an sql template for querying the target data relationship is generated by using a recursive algorithm, so as to perform number-increasing development by modifying the sql template.
Specifically, a data model under a certain subject domain is selected, the data relationship of the model is read, generation of query sql between relationship tables is realized by using a recursive function, the function is described below by combining a specific example, and a flow diagram of a recursive method for generating the query sql is shown in fig. 3.
In step S301, local variables (sql, modelId, tableId, tabNum) and a global variable tableNum (table number) are input.
First, a function declaration of Java language is private StringBuilder sqlnproducer (StringBuilder sql, stringmodel id, stringtableid, int tabNum); there is a global variable tablemum with an initial value of 0.
In step S302, the table field and table name are looked up using tableld. Specifically, the global variable tablemum is linked by the character't' for this part of the name, resulting in, for example:
select\n\tname,age\nfrom student t0。
step S303, splicing the 'select table field'. The field information and the table name information of the table are obtained from the database through tableId, such as the field id, the name and the age of the table, a splicing selection part is adopted, tabNum is used as indentation control, the initial value is 0, the formatting of sql is controlled and generated, the initial \ t quantity of each indentation is the same as that of tabNum, and when the indentation is needed, the \ t quantity of tabNum +1 is used. The results are for example:
select\n\tid,name,age\nfrom employee。
step S304, splicing the "from table name".
In step S305, the table is named "t" + tablemm. Specifically, the global variable tablemum is named for this part by the character't', resulting in, for example:
select\n\tname,age\nfrom student t0。
step S306, using modelId (service model id) and tableId to query the right table information. Specifically, right connection table information such as table destination field name and employeeId is queried through modeld and tableId, the connection mode is left connection left join (the other connection modes are the same), the first half of sql is spliced and connected, and the result is, for example:
select\n\tid,name,age\nfrom student t0\nleft join(\n。
in step S307, it is determined whether the query right connection table exists, if so, step S309 is executed, and if not, step S308 is executed.
And step S308, returning to the generated sql.
Step S309, determine the type left, right, inner, full and field of the connection.
And step S310, splicing and connecting the front half part of the sql and the left bracket.
In step S311, the "sql generation flow" (sql, modelId, rightTableId, tabNum + 1) is called again with the currently generated sql and the connection table id as parameters to splice the return values.
And step S312, splicing and connecting the second half part of the sql and the right bracket.
In step S313, the table is named "t" + tablemm.
Step S314, tableNum + +. Wherein, the second half of the concatenation connection sql names the connected sub-query part according to the global variable tableNum, and the result is, for example:
select\n\tid,name,age\nfrom student t0\nleft join
(\n\tselect\n\t\tname,employeeId\n\tfrom department t1)t2 on t0.id=t2.employeeId
and step S315, splicing connection conditions. Specifically, if there are a plurality of connection conditions, the and splicing process is used.
And step S316, returning the generated sql.
According to the method for generating the query sql based on the hive metadata, metadata information is obtained, the metadata information is labeled and then stored in a preset mysql database, initial metadata is obtained, field lists of different tables are determined based on the initial metadata, the field lists of the different tables are connected, a data table association relation is generated, the data table association relation is stored to a preset subject domain as a data model, the preset subject domain is stored to the preset mysql database, a target data relation in the data model is obtained, an sql template for querying the target data relation is generated by using a recursive algorithm, and the number is extracted and developed by modifying the sql template. Therefore, the problem that the relation between metadata and a data table needs to be maintained due to the fact that the writing of complex sql statements is carried out by means of a big data component, and an operator needs to write the complex sql according to the relation between the tables, so that the development efficiency is low is solved, the maintenance cost of an administrator is reduced, and the development efficiency is improved.
Next, an apparatus for generating a query sql based on hive metadata according to an embodiment of the present application will be described with reference to the drawings.
Fig. 4 is a block diagram illustrating an apparatus for generating a query sql based on hive metadata according to an embodiment of the present application.
As shown in fig. 4, the apparatus 10 for generating a query sql based on hive metadata includes: an annotation module 100, a storage module 200, and a generation module 300.
The labeling module 100 is configured to acquire metadata information, label the metadata information, and store the labeled metadata information in a preset mysql database to obtain initial metadata; the storage module 200 is configured to determine field lists of different tables based on the initial metadata, connect the field lists of the different tables, generate a data table association relationship, store the data table association relationship as a data model to a preset topic domain, and store the preset topic domain to a preset mysql database; the generating module 300 is configured to obtain a target data relationship in the data model, and generate an sql template for querying the target data relationship by using a recursive algorithm, so as to perform number-increasing development by modifying the sql template.
Optionally, in some embodiments, the labeling module 100 is configured to: acquiring information of an existing database and a data table through a live metastore api; based on the existing database, using hive metastore api and hdfs api to obtain the metadata information in the data table information; and marking metadata information based on preset marking data, and storing the marked metadata information to a preset mysql database.
Optionally, in some embodiments, the metadata information includes at least one of a table name, a table remark, table field information, a size of a hard disk space where table raw data is located, and a distributed file system HDFS space occupied by the table data.
Optionally, in some embodiments, the preset annotation data includes: at least one of the names of the characters in the table, the updating period and the preset loading strategy.
Optionally, in some embodiments, the data model is a tree structure consisting of a fact table and a plurality of dimension tables, wherein the fact table is a root node.
Optionally, in some embodiments, the data table associations include a first association between a fact table and any one of the plurality of dimension tables and a second association between any two of the plurality of dimension tables.
It should be noted that the foregoing explanation of the embodiment of the apparatus for generating the query sql based on the hive metadata is also applicable to the method for generating the query sql based on the hive metadata in this embodiment, and is not described herein again.
According to the device for generating the query sql based on the hive metadata, metadata information is obtained, the metadata information is labeled and then stored in a preset mysql database, initial metadata is obtained, field lists of different tables are determined based on the initial metadata, the field lists of the different tables are connected, a data table association relation is generated, the data table association relation is stored to a preset subject domain as a data model, the preset subject domain is stored to the preset mysql database, a target data relation in the data model is obtained, an sql template for querying the target data relation is generated by using a recursive algorithm, and the number is extracted and developed by modifying the sql template. Therefore, the problem that the relation between metadata and a data table needs to be maintained due to the fact that the writing of complex sql statements is carried out by means of a big data component, and an operator needs to write the complex sql according to the relation between the tables, so that the development efficiency is low is solved, the maintenance cost of an administrator is reduced, and the development efficiency is improved.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
The vehicle may include: a memory 501, a processor 502, and a computer program stored on the memory 501 and executable on the processor 502.
The processor 502, when executing a program, implements the method of generating a query sql based on hive metadata as provided in the embodiments above.
Further, the electronic device further includes:
a communication interface 503 for communication between the memory 501 and the processor 502.
A memory 501 for storing computer programs operable on the processor 502.
The memory 501 may comprise high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
If the memory 501, the processor 502 and the communication interface 503 are implemented independently, the communication interface 503, the memory 501 and the processor 502 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 5, but this is not intended to represent only one bus or type of bus.
Optionally, in a specific implementation, if the memory 501, the processor 502, and the communication interface 503 are integrated on one chip, the memory 501, the processor 502, and the communication interface 503 may complete mutual communication through an internal interface.
The processor 502 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), or one or more Integrated circuits configured to implement embodiments of the present Application.
Embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored, where the computer program, when executed by a processor, implements the above method for generating a query sql based on hive metadata.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above are not necessarily intended to refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "N" means at least two, e.g., two, three, etc., unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing steps of a custom logic function or process, and alternate implementations are included within the scope of the preferred embodiment of the present application in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of implementing the embodiments of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. If implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable gate arrays, field programmable gate arrays, and the like.
It will be understood by those skilled in the art that all or part of the steps carried out in the method of implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and the program, when executed, includes one or a combination of the steps of the method embodiments.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.