Disclosure of Invention
In view of this, an object of the present application is to provide a data query method, an apparatus, an electronic device and a computer-readable storage medium, which improve data query efficiency.
In a first aspect, an embodiment of the present application provides a data query method, including:
obtaining multi-source heterogeneous data, and performing clustering analysis on the multi-source heterogeneous data to obtain dimensions contained in the multi-source heterogeneous data and a dimension hierarchy corresponding to each dimension;
constructing a dimension table based on the obtained dimensions and the dimension level corresponding to each dimension, and constructing a fact table associated with the dimension table based on the obtained multi-source heterogeneous data;
constructing a data cube based on the dimension table and the fact table;
based on the dimension combination contained in the dimension table, creating a structured query language statement corresponding to the dimension combination, acquiring a cube corresponding to the dimension combination from the constructed data cube, and storing the structured query language statement and the cube into a database;
receiving query information input by a user, converting the query information into structured query information, acquiring cube information corresponding to the structured query information from the database, and outputting the acquired cube information to the user.
With reference to the first aspect, the present invention provides a first possible implementation manner of the first aspect, where constructing a fact table associated with the dimension table based on the obtained multi-source heterogeneous data includes:
constructing a star graph containing a dimension table and a fact table associated with the dimension table according to the multi-source heterogeneous data;
creating a wide table by calling a data warehouse according to the fact table and the dimension table of the star map;
removing the weight of the dimension column in the wide table;
and creating a dimension dictionary according to the de-duplicated wide table, and associating the dimension dictionary with the fact table.
With reference to the first aspect, the present invention provides a second possible implementation manner of the first aspect, where constructing a fact table associated with the dimension table based on the obtained multi-source heterogeneous data includes:
reading a dimension and a data value corresponding to the dimension from multi-source heterogeneous data;
inquiring whether the read dimensionality exists in the stored dimensionality table;
if the dimension does not exist, the read dimension is created in the dimension table, the data value corresponding to the read dimension is written into a stored fact table, the association relation between the read dimension and the data value written into the fact table is built, and the step of reading another dimension from the multi-source heterogeneous data is executed until the dimension in the multi-source heterogeneous data is completely read;
and if the data values exist, inquiring whether the data values corresponding to the read dimensions in the fact table are the same, if so, executing a step of reading another dimension from the multi-source heterogeneous data until the dimensions in the multi-source heterogeneous data are completely read, if not, writing the data values corresponding to the read dimensions in the fact table, constructing an association relation between the read dimensions and the data values written in the fact table, and executing a step of reading another dimension from the multi-source heterogeneous data until the dimensions in the multi-source heterogeneous data are completely read.
With reference to the first aspect, the present invention provides a third possible implementation manner of the first aspect, where the building a data cube based on the dimension table and the fact table includes:
creating a database table;
generating dimension combinations contained in the dimension table based on the dimension table, and inquiring data corresponding to each dimension combination based on the fact table to form a cube of the dimension combination;
and converting the data in the cubes of all the dimension combinations into a format supported by the database table, and writing the data into the created database table to obtain the data cube.
With reference to the first aspect, the present invention provides a fourth possible implementation manner of the first aspect, where the building a data cube based on the dimension table and the fact table includes:
constructing a coordinate system of the data cube according to the dimension table;
and reading a data value corresponding to the dimension table from a fact table associated with the dimension table, and filling the data value to a coordinate point corresponding to the dimension table in the coordinate system to obtain the data cube.
With reference to the first aspect and any one of the first to the fourth possible implementation manners of the first aspect, the present invention provides a fifth possible implementation manner of the first aspect, where after the obtaining of the multi-source heterogeneous data, before performing cluster analysis on the multi-source heterogeneous data, the method further includes:
and preprocessing the acquired multi-source heterogeneous data.
With reference to the fifth possible implementation manner of the first aspect, the present invention provides a sixth possible implementation manner of the first aspect, wherein the preprocessing includes: deleting abnormal multi-source heterogeneous data, repairing missing values in the multi-source heterogeneous data, and performing format conversion on the multi-source heterogeneous data according to a preset format or any combination thereof.
In a second aspect, an embodiment of the present application provides a data query apparatus, including:
the dimensionality acquisition module is used for acquiring multi-source heterogeneous data and carrying out clustering analysis on the multi-source heterogeneous data to obtain dimensionality contained in the multi-source heterogeneous data and a dimensionality hierarchy corresponding to each dimensionality;
the fact table building module is used for building a dimension table based on the obtained dimensions and the dimension hierarchy corresponding to each dimension, and building a fact table associated with the dimension table based on the obtained multi-source heterogeneous data;
the data cube construction module is used for constructing a data cube based on the dimension table and the fact table;
the database construction module is used for creating a structured query language sentence corresponding to the dimension combination based on the dimension combination contained in the dimension table, acquiring a cube corresponding to the dimension combination from the constructed data cube, and storing the structured query language sentence and the cube into a database;
and the query module is used for receiving query information input by a user, converting the query information into structured query information, acquiring cube information corresponding to the structured query information from the database, and outputting the acquired cube information to the user.
In a third aspect, an embodiment of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, where the processor implements the steps of the method when executing the computer program.
In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, performs the steps of the method described above.
According to the data query method, the data query device, the electronic equipment and the computer readable storage medium, multi-source heterogeneous data are obtained, clustering analysis is carried out on the multi-source heterogeneous data, and dimensions contained in the multi-source heterogeneous data and a dimension level corresponding to each dimension are obtained; constructing a dimension table based on the obtained dimensions and the dimension level corresponding to each dimension, and constructing a fact table associated with the dimension table based on the obtained multi-source heterogeneous data; constructing a data cube based on the dimension table and the fact table; based on the dimension combination contained in the dimension table, creating a structured query language statement corresponding to the dimension combination, acquiring a cube corresponding to the dimension combination from the constructed data cube, and storing the structured query language statement and the cube into a database; receiving query information input by a user, converting the query information into structured query information, acquiring cube information corresponding to the structured query information from the database, and outputting the acquired cube information to the user. Therefore, the dimension table and the fact table are built based on the multi-source heterogeneous data, the data cube is built based on the dimension table and the fact table, each dimension combination is carried out according to the dimension table, each dimension combination is packaged into the structured query language statement, the query result corresponding to the packaged structured query language statement is obtained from the data cube and stored into the database, so that when in query, the database is queried according to the structured query information corresponding to the query request, the corresponding query result is obtained, massive multi-source heterogeneous data do not need to be queried according to the query request and are matched with the data to obtain the query result, and the query speed and the query efficiency can be effectively improved.
In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of a data query method according to an embodiment of the present application. As shown in fig. 1, the method includes:
step 101, obtaining multi-source heterogeneous data, and performing clustering analysis on the multi-source heterogeneous data to obtain dimensions contained in the multi-source heterogeneous data and a dimension level corresponding to each dimension;
in the embodiment of the application, the multi-source heterogeneous data can be from a relational database or from Hadoop Hive. The Hadoop Hive is a data warehouse established on a Hadoop file system and supports structured data, semi-structured data and unstructured data. As another alternative, multi-source heterogeneous data can also be obtained by Kafka. The Kafka is used for obtaining multi-source heterogeneous data, and the Kafka can also be used for converting and preprocessing the obtained multi-source heterogeneous data.
In the embodiment of the application, after the multi-source heterogeneous data is subjected to cluster analysis, a data analysis topic set can be obtained, and the data analysis topic set comprises a plurality of data analysis topics. Each data analysis topic corresponds to one or more dimensions.
In this embodiment, as an optional embodiment, after the obtaining the multi-source heterogeneous data, before performing cluster analysis on the multi-source heterogeneous data, the method further includes:
and preprocessing the acquired multi-source heterogeneous data.
In the embodiment of the present application, as an optional embodiment, the preprocessing includes, but is not limited to: deleting abnormal multi-source heterogeneous data, repairing missing values in the multi-source heterogeneous data, performing format conversion on the multi-source heterogeneous data according to a preset format and the like.
In the embodiment of the application, data corresponding to abnormal data values in multi-source heterogeneous data are deleted. And determining whether the data value is abnormal or not can be obtained by judging according to a data abnormal value preset for the type of data. The method comprises the steps of classifying the acquired multi-source heterogeneous data by a machine learning method, analyzing the mass data corresponding to each classification, and setting a data abnormal value corresponding to the classified data by combining expert analysis.
In this embodiment, as an optional embodiment, the repairing of the missing value in the multi-source heterogeneous data may be performed by, after deleting abnormal data and data with empty data values in the classified data, counting a mean value of each data value of the remaining multi-source heterogeneous data, where the corresponding data value is empty mapping, and replacing the empty data value.
102, constructing a dimension table based on the obtained dimensions and the dimension level corresponding to each dimension, and constructing a fact table associated with the dimension table based on the obtained multi-source heterogeneous data;
in the embodiment of the application, as an optional embodiment, the dimension table is identified by a primary key, the fact table is identified by a foreign key, and the dimension table and the fact table are associated by the primary key and the foreign key.
In the embodiment of the present application, as an optional embodiment, a hash operation is performed on the dimension table to obtain the primary keyword, and a hash operation is performed on the fact table to obtain the foreign keyword.
In an embodiment of the present application, a dimension table includes one or more dimensions, and each dimension includes one or more dimension hierarchies. As an optional embodiment, the fact table includes a dimension level attribute column and a dimension level attribute value column, where the dimension level attribute corresponds to a dimension level, and the dimension level attribute value corresponds to a data value corresponding to the dimension level.
In this application embodiment, use the heterogeneous data of multisource that acquire as each model oil recovery equipment oil production data as the example, cluster based on the heterogeneous data of multisource that acquires, can obtain the three dimensions of oil recovery equipment model, region and time, set up table 1 and record each model oil recovery equipment in each region annual oil production condition. Table 2 records the model of the oil production equipment, table 3 records the region, table 4 records the time, and table 2, table 3 and table 4 are respectively associated with table 1 through external keys. Wherein, table 1 is a fact table, and tables 2, 3, and 4 are dimension tables. In the embodiment of the present application, table 2, table 3, and table 4 may be combined into one dimension table.
In this embodiment of the present application, as an optional embodiment, constructing a fact table associated with the dimension table based on the obtained multi-source heterogeneous data includes:
a01, constructing a star graph containing a dimension table and a fact table associated with the dimension table according to the multi-source heterogeneous data;
in the examples of the present application, the Star chart (Star chart) was constructed to include tables 1, 2, 3 and 4.
In the embodiment of the present application, taking the multi-source heterogeneous data as the mining well development month data as an example, it is assumed that the mining well development month data includes five dimensions, which are:
the organizational dimension (T _ CAL _ ORG), the geological unit dimension (T _ CAL _ ADDRUNIT), the time dimension (year, month T _ CAL _ DT), the production mode dimension (T _ CAL _ PROMODE), and the WELLBORE dimension (T _ CAL _ WELLBORE). Wherein,
the dimension hierarchy corresponding to the dimension of the organization mechanism comprises: organization number, organization name, organization abbreviation, father organization, etc.;
the dimension hierarchy corresponding to the geocellular dimension comprises: geocellular number, geocellular name, parent geocellular, type, etc.;
the dimension hierarchy corresponding to the time dimension comprises: year, month, day, quarter, etc.;
the dimension hierarchy corresponding to the extraction mode dimension comprises: a sampling mode number, a sampling mode name, and the like;
the dimension hierarchy corresponding to the wellbore dimension comprises: wellbore number, wellbore name, wellbore geographic location, wellbore type, etc. As shown in table 5.
TABLE 5
A fact table is constructed corresponding to Table 5, which is WELL (T _ WELL). Wherein,
attributes of the WELL (T _ WELL) include, but are not limited to: wellbore ID, geocellular ID, facility ID, geographic location, key well grade, date drilled, date completed, oil production, and time of statistics, among others.
In the embodiment of the present application, the dimensions in the Star diagram are defined by the following program code segments:
t _ WELL INNER JOIN T _ CAL _ ORG ON T _ well, organization ID ═ T _ CAL _ ORG, organization ID;
t _ WELL INNER JOIN T _ CAL _ wellbolt ON T _ well. WELLBORE ID ═ T _ CAL _ wellbolt. WELLBORE ID;
t _ WELL INNER JOIN T _ CAL _ addrontit ON T _ well. geocellular ID ═ T _ CAL _ addrontit. geocellular ID;
t _ WELL INNER JOIN T _ CAL _ DT ON MONTH (T _ well, date of production) MONTH (T _ CAL _ DT. date) and YEAR (T _ well, date of production) YEAR (T _ CAL _ DT. date);
t _ WELL INNER JOIN T _ CAL _ mode ON T _ well, extraction mode ID is T _ CAL _ mode.
Metrics that define fact tables: SUM (total net production), MIN (minimum net production), MAX (maximum net production), well ID, well name.
A02, creating a wide table by calling a data warehouse (Hive) according to the fact table and the dimension table of the star map;
in the embodiment of the application, whether the dimension is in the wide table is judged aiming at each dimension in the dimension table, and if the dimension is included, the processing is not carried out; if not, the dimension is added to the broad table.
In the embodiment of the present application, the program code segments for creating the wide table are as follows:
Create Intermediate Flat Hive Table
hive-e"DROP TABLE IF EXISTS Well_cube_desc_19700101000000_20140101000000_94eeff_f3ed_4c38_9e2f_b675f6cce23d;
CREATE EXTERNAL TABLE IF NOT EXISTS Well_cube_desc_19700101000000_20140101000000_94eeff_f3ed_4c38_9e2f_b675f6cce23d
(
DEFAULT _ WELL _ CATEG _ DT date- -time dimension
DEFAULT _ WELL _ CATEG _ ORGID string- -mechanism dimension
DEFAULT _ WELL _ CATEG _ WELLBORED string- -wellbore dimension
DEFAULT _ WELL _ CATEG _ ADDRUNITID string-address unit dimension
DEFAULT _ WELL _ CATEG _ PROMODEID string-sampling mode dimension
DEFAULT _ WELL _ CATEG _ WELLNAME string-WELL name
DEFAULT _ WELL _ CATEG _ WELLBORENME string-wellbore name
DEFAULT _ WELL _ WELL _ OUTPUT decimal (19,4) - - -yield
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY'\177'
STORED AS SEQUENCEFILE
LOCATION '/tmp/well-94eeff_f3ed_4c38_9e2f_b675f6cce23d/Well_cube_desc_19700101000000_20140101000000_94eeff_f3ed_4c38_9e2f_b675f6cce23d'
A03, removing the weight of the dimension column in the wide table;
in this embodiment, the deduplicated wide table may be stored in a Hadoop Distributed File System (HDFS). And performing deduplication on the same dimension column in the wide table, so that the deduplicated wide table comprises the following dimension columns:
SELECT DISTINCTION DEFELT _ WELL _ CATEG _ DT date- -time dimension
DEFAULT _ WELL _ CATEG _ ORGID string- -mechanism dimension
DEFAULT _ WELL _ CATEG _ WELLBORED string- -wellbore dimension
DEFAULT _ WELL _ CATEG _ ADDRUNITID string-address unit dimension
DEFAULT _ WELL _ CATEG _ PROMODEID string-sampling mode dimension
DEFAULT _ WELL _ CATEG _ WELLNAME string-WELL name
DEFAULT _ WELL _ CATEG _ WELLBORENME string-wellbore name
DEFAULT _ WELL _ WELL _ OUTPUT decimal (19,4) - - -yield
FROM T _ WELL INNER JOIN T _ CAL _ ORG ON T _ well, organization ID T _ CAL _ ORG
INNER JOIN T _ CAL _ WELLBORE ON T _ WELL. WELLBORE ID ═ T _ CAL _ WELLBORE. WELLBORE ID
INNER JOIN T _ CAL _ addrontit ON T _ well. geocellular ID T _ CAL _ addrontit
INNER JOIN T _ CAL _ DT ON MONTH (T _ well, date of operation) ═ MONTH (T _ CAL _ DT. date) and YEAR (T _ well, date of operation) ═ YEAR (T _ CAL _ DT. date)
The acquisition mode ID is T _ CAL _ mode
A04, creating a dimension dictionary according to the de-duplicated wide table, and associating the dimension dictionary with the fact table.
In the embodiment of the present application, as an optional embodiment, a mapping reduction (MapReduce) method is used, a deduplication value of a dimension is extracted from a fact table, compression coding is performed in a dictionary tree manner, and meanwhile, compression coding is performed on all dimension tables to generate a dimension dictionary, where the dimension dictionary is a dimension table including each dimension level.
In the embodiment of the present application, as an optional embodiment, the dimension dictionary is a Hive dictionary. The Hive is a data warehouse tool based on Hadoop, can map structured data into a database table, provides a complete structured query language function, and can convert a structured query language statement into a mapping reduction (MapReduce) task for operation.
In this embodiment of the present application, as another optional embodiment, constructing a fact table associated with the dimension table includes:
a11, reading a dimension and a data value corresponding to the dimension from multi-source heterogeneous data;
in the embodiment of the application, the multi-source heterogeneous data comprises a data name and a data value corresponding to the data name. The name of the data is the dimension to be read in the embodiment of the application.
A12, inquiring whether the read dimension exists in the stored dimension table;
in the embodiment of the application, whether the read dimension is matched with any dimension in the dimension table can be determined in a key word matching mode, and if the read dimension is matched with any dimension in the dimension table, the dimension is determined to exist in the dimension table.
A13, if not, creating the read dimension in a dimension table, writing the data value corresponding to the read dimension into a stored fact table, constructing the association relationship between the read dimension and the data value written into the fact table, and executing the step of reading another dimension from the multi-source heterogeneous data until the dimension in the multi-source heterogeneous data is completely read;
a14, if yes, inquiring whether a data value which is the same as the data value corresponding to the read dimension exists in the fact table, if yes, executing a step of reading another dimension from the multi-source heterogeneous data until the dimension in the multi-source heterogeneous data is completely read, if not, writing the data value corresponding to the read dimension into the fact table, constructing the association relationship between the read dimension and the data value written into the fact table, and executing a step of reading another dimension from the multi-source heterogeneous data until the dimension in the multi-source heterogeneous data is completely read.
103, constructing a data cube based on the dimension table and the fact table;
in this embodiment, as an optional embodiment, constructing a data cube based on the dimension table and the fact table includes:
a21, creating a database table;
in the embodiment of the application, the database Table is an HBase Table.
A22, generating dimension combinations contained in the dimension table based on the dimension table, and querying data corresponding to each dimension combination based on the fact table to form a cube of the dimension combination;
in this embodiment of the present application, generating a dimension combination included in the dimension table based on the dimension table includes:
calculating a basic cube based on the dimension table; and calculating secondary cubes based on the basic cubes, and obtaining the dimension combination contained in the dimension table according to the basic cubes and the calculated secondary cubes at all levels.
In the embodiment of the application, all dimension combinations in the dimension table are calculated and stored. Wherein, each dimension combination corresponds to a cube.
In the embodiment of the present application, if the number of dimensions included in the dimension table is N, the number of dimension combinations is N times of 2, and as an optional embodiment, a layer-by-layer algorithm is adopted to generate the dimension combinations. Wherein,
in the layer-by-layer algorithm, calculation is performed in a mode that the dimension number decreases layer by layer, all dimensions contained in the dimension table are aggregated to obtain a dimension combination of a first level, and for the dimension combinations of other levels, calculation is performed based on the dimension combination of the previous level of the level.
And A23, converting the data in the cube with all the dimension combinations into a format supported by a database table, and writing the format into the created database table to obtain the data cube.
In the embodiment of the application, data is converted into HFile and is imported into HBase Table.
In the embodiment of the present application, the program code segment for writing data into the database table is as follows:
INSERT OVERWRITE TABLE Well_cube_desc_19700101000000_20140101000000_94eeff_f3ed_4c38_9e2f_b675f6cce23d SELECT
FACT_TABLE.CATEG_DT
,FACT_TABLE.CATEG_ORGID
,FACT_TABLE.CATEG_WELLBOREID
FACT_TABLE.CATEG_ADDRUNITID
,FACT_TABLE.CATEG_PROMODEID
FACT_TABLE.CATEG_WELLNAME
,LOOKUP_2.CATEG_WELLBORENAME
,FACT_TABLE.WELL_OUTPUT
FROM DEFAULT.T_WELL as FACT_TABLE
INNER JOIN DEFAULT.T_CAL_DT as LOOKUP_1
ON FACT_TABLE.CATEG_DT=LOOKUP_1.CAL_DT
INNER JOIN DEFAULT.T_CAL_WELLBORE as LOOKUP_2
ON FACT_TABLE.CATEG_WELLBOREID=LOOKUP_2.CATEG_WELLBOREID
INNER JOIN T_CAL_ORG as LOOKUP_3
ON FACT_TABLE.CATEG_ORGID=LOOKUP_3.CATEG_ORGID
INNER JOIN T_CAL_ADDRUNIT as LOOKUP_4
ON FACT_TABLE.CATEG_ADDRUNIT ID=LOOKUP_4.CATEG_ADDRUNIT ID
INNER JOIN T_CAL_PROMODE as LOOKUP_5
ON FACT_TABLE.CATEG_PROMODE ID=LOOKUP_5.CATEG_PROMODEID
WHERE(FACT_TABLE.CATEG_DT<'2019-05-31')
in this embodiment, as another optional embodiment, building a data cube based on the dimension table and the fact table includes:
a31, constructing a coordinate system of the data cube according to the dimension table;
in the embodiment of the application, each dimension in the dimension table corresponds to a coordinate axis in the multidimensional coordinate system, and the dimension number included in the dimension table is the number of coordinate axes included in the coordinate system of the data cube.
A32, reading the data value corresponding to the dimension table from the fact table associated with the dimension table, and filling the data value to the coordinate point corresponding to the dimension table in the coordinate system to obtain the data cube.
In the embodiment of the application, the data value is a fact in the fact table, the data value corresponding to each dimension level information is obtained from the fact table according to the dimension level information of each dimension in the dimension table, and is added to the coordinate point corresponding to the dimension level information, so that the construction of the data cube is realized.
In the embodiment of the application, because the data cube is associated with the dimension table and the fact table, when the dimension in the dimension table or the numerical value in the fact table is updated, the coordinate point corresponding to the data cube is correspondingly dynamically updated.
104, establishing a structured query language statement corresponding to the dimension combination based on the dimension combination contained in the dimension table, acquiring a cube corresponding to the dimension combination from the constructed data cube, and storing the structured query language statement and the cube into a database;
in the embodiment of the present application, taking tables 1, 2, 3 and 4 as examples, in the data cube, time, types of oil production equipment and regions are dimensions, and oil production is a measure. For each model of oil recovery equipment, a cube (cuboid) is constructed according to the oil production of the model of oil recovery equipment in a first time period and a first region, for example, the oil production of M-type oil recovery equipment in Daqing is a cube in 2019.01.01-2019.06.30 time period, and all the cubes constructed form a data cube.
In the embodiment of the application, dimension combination is performed based on a dimension table, structured query language statements are packaged based on the dimension combination, query results corresponding to the packaged structured query language statements are obtained from a data cube or a star map and stored in a database, so that in the subsequent query, the data cube does not need to be queried according to a query request, and the database is directly queried, so that corresponding query results are obtained, namely, query SQL corresponding to each combination is set for possible combinations of dimensions in the dimension table, the SQL corresponding to the combination is processed in advance to obtain the query results and is persisted in the database, so that the query results persisted in the database can be reused in the actual query, the query speed is effectively increased, and the query efficiency is improved. In the embodiment of the present application, as an optional embodiment, a combination that all dimensions may exist may also be pre-calculated for data in the star map, and stored in the database as a subcube.
And 105, receiving query information input by a user, converting the query information into structured query information, acquiring cube information corresponding to the structured query information from the database, and outputting the acquired cube information to the user.
In the embodiment of the application, the data cube is transparent to user query, is mapped to the corresponding data cube by automatically analyzing the structured query language, and acquires the data of the coordinate point corresponding to the mapped data cube and outputs the data as a query result; if the mapping is not mapped to the corresponding data cube, the query result is obtained by querying the dimension table and the fact table or the data warehouse.
In the embodiment of the application, if well type dimension statistics of wells needs to be added subsequently, the well type dimensions can be adaptively added in the dimension table, blood margin analysis and feature analysis of fields contained in the dimension table are performed, for example, a Neo4j graph database is constructed according to multi-source heterogeneous data, relationships among metadata corresponding to the dimensions, entity attribute relationships, blood margin analysis of the fields, feature analysis and data access frequency analysis are performed based on the Neo4j graph database, and therefore cubes corresponding to the well type dimensions are automatically filled in data cubes. Data cubes are a type of multi-dimensional matrix that allow users to explore and analyze data from multiple dimensions.
In the embodiment of the present application, taking increasing well type dimensionality in the data of a production well development month as an example, the method includes 6 dimensionalities, which are respectively:
the organizational dimension (T _ CAL _ ORG), the geological unit dimension (T _ CAL _ ADDRUNIT), the time dimension (year, month T _ CAL _ DT), the production mode dimension (T _ CAL _ PROMODE), the WELLBORE dimension (T _ CAL _ WELLBORE), and the well type dimension (T _ CAL _ WELLBYLTYPE). As shown in table 6.
TABLE 6
Adding an attribute description corresponding to the WELL-type dimension in the fact table, namely the attributes of the WELL (T _ WELL), comprises the following steps: wellbore ID, geocellular ID, organization ID, well type ID, geographic location, key well grade, drill-in date, drill-out date, completion date, oil production, and statistical time.
In the Star diagram, a new dimension is defined:
t _ WELL INNER JOIN T _ CAL _ well ON T _ well, well ID T _ CAL _ well.
In the embodiment of the application, based on the database, the columns cached in the fact table, the oil production amount in each time period, the oil production amount in the preset time period and the like can be inquired.
Fig. 2 is a schematic structural diagram of a data query device according to an embodiment of the present application. As shown in fig. 2, the apparatus includes:
the dimension obtaining module 201 is configured to obtain multi-source heterogeneous data, perform cluster analysis on the multi-source heterogeneous data, and obtain dimensions included in the multi-source heterogeneous data and a dimension hierarchy corresponding to each dimension;
in an embodiment of the present application, the multi-source heterogeneous data includes: structured data, semi-structured data, and unstructured data.
A fact table construction module 202, configured to construct a dimension table based on the obtained dimensions and the dimension hierarchy corresponding to each dimension, and construct a fact table associated with the dimension table based on the obtained multi-source heterogeneous data;
in the embodiment of the application, as an optional embodiment, the dimension table is identified by a primary key, the fact table is identified by a foreign key, and the dimension table and the fact table are associated by the primary key and the foreign key.
In this embodiment of the present application, as an optional embodiment, the fact table constructing module constructs a fact table associated with the dimension table based on the obtained multi-source heterogeneous data, including:
constructing a star graph containing a dimension table and a fact table associated with the dimension table according to the multi-source heterogeneous data;
creating a wide table by calling a data warehouse according to the fact table and the dimension table of the star map;
removing the weight of the dimension column in the wide table;
and creating a dimension dictionary according to the de-duplicated wide table, and associating the dimension dictionary with the fact table.
In this embodiment, as another optional embodiment, the fact table constructing module constructs a fact table associated with the dimension table based on the obtained multi-source heterogeneous data, including:
reading a dimension and a data value corresponding to the dimension from multi-source heterogeneous data;
inquiring whether the read dimensionality exists in the stored dimensionality table;
if the dimension does not exist, the read dimension is created in the dimension table, the data value corresponding to the read dimension is written into a stored fact table, the association relation between the read dimension and the data value written into the fact table is built, and the step of reading another dimension from the multi-source heterogeneous data is executed until the dimension in the multi-source heterogeneous data is completely read;
and if the data values exist, inquiring whether the data values corresponding to the read dimensions in the fact table are the same, if so, executing a step of reading another dimension from the multi-source heterogeneous data until the dimensions in the multi-source heterogeneous data are completely read, if not, writing the data values corresponding to the read dimensions in the fact table, constructing an association relation between the read dimensions and the data values written in the fact table, and executing a step of reading another dimension from the multi-source heterogeneous data until the dimensions in the multi-source heterogeneous data are completely read.
A data cube construction module 203, configured to construct a data cube based on the dimension table and the fact table;
in this embodiment, as an optional embodiment, the data cube construction module 203 is specifically configured to:
creating a database table;
generating dimension combinations contained in the dimension table based on the dimension table, and inquiring data corresponding to each dimension combination based on the fact table to form a cube of the dimension combination;
and converting the data in the cubes of all the dimension combinations into a format supported by the database table, and writing the data into the created database table to obtain the data cube.
In this embodiment, as another optional embodiment, the data cube construction module 203 is specifically configured to:
constructing a coordinate system of the data cube according to the dimension table;
and reading a data value corresponding to the dimension table from a fact table associated with the dimension table, and filling the data value to a coordinate point corresponding to the dimension table in the coordinate system to obtain the data cube.
A database construction module 204, configured to create a structured query language statement corresponding to the dimension combination based on the dimension combination included in the dimension table, obtain a cube corresponding to the dimension combination from the constructed data cube, and store the structured query language statement and the cube in a database;
the query module 205 is configured to receive query information input by a user, convert the query information into structured query information, obtain cube information corresponding to the structured query information from the database, and output the obtained cube information to the user.
In this embodiment, as an optional embodiment, the apparatus further includes:
and the preprocessing module (not shown in the figure) is used for preprocessing the acquired multi-source heterogeneous data.
In this embodiment, as an optional embodiment, the preprocessing includes: deleting abnormal multi-source heterogeneous data, repairing missing values in the multi-source heterogeneous data, and performing format conversion on the multi-source heterogeneous data according to a preset format or any combination thereof.
As shown in fig. 3, an embodiment of the present application provides a computer device 300, configured to execute the data query method in fig. 1, where the device includes a memory 301, a processor 302, and a computer program stored in the memory 301 and executable on the processor 302, where the processor 302 implements the steps of the data query method when executing the computer program.
Specifically, the memory 301 and the processor 302 can be general-purpose memory and processor, and are not limited to specific examples, and the processor 302 can execute the data query method when executing the computer program stored in the memory 301.
Corresponding to the data query method in fig. 1, an embodiment of the present application further provides a computer-readable storage medium, on which a computer program is stored, and the computer program is executed by a processor to perform the steps of the data query method.
In particular, the storage medium can be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, and when executed, the computer program on the storage medium can execute the data query method.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments provided in the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus once an item is defined in one figure, it need not be further defined and explained in subsequent figures, and moreover, the terms "first", "second", "third", etc. are used merely to distinguish one description from another and are not to be construed as indicating or implying relative importance.
Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the present disclosure, which should be construed in light of the above teachings. Are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.