CN106649754A - Method and system for testing data warehouse performance - Google Patents
Method and system for testing data warehouse performance Download PDFInfo
- Publication number
- CN106649754A CN106649754A CN201611220973.1A CN201611220973A CN106649754A CN 106649754 A CN106649754 A CN 106649754A CN 201611220973 A CN201611220973 A CN 201611220973A CN 106649754 A CN106649754 A CN 106649754A
- Authority
- CN
- China
- Prior art keywords
- data
- data warehouse
- jobs
- warehouse
- machining path
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Prevention of errors by analysis, debugging or testing of software
- G06F11/3604—Analysis of software for verifying properties of programs
- G06F11/3612—Analysis of software for verifying properties of programs by runtime analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/25—Integrating or interfacing systems involving database management systems
- G06F16/254—Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a method and system for testing data warehouse performance. According to the method, by conducting detection and amendment on data types used by data in a data warehouse, storage space occupied by the data in the data warehouse is reduced, and thus the operating performance of the data warehouse is improved; by detecting distribution keys of physical tables, rank storing requirements of data tables, table partitions of the physical tables and table connection relations among the physical tables in which connection relations exist in the data warehouse, and conducting detection and amendment on ranked sentences of the physical tables, the operating speed of the data warehouse is increased, thus the purpose of conducting comprehensive performance testing and optimization on the data warehouse is achieved, and a practicable method is provided for the comprehensive performance testing to find out problems before the data warehouse is used online.
Description
Technical field
The application is related to data warehouse performance technical field of measurement and test, more particularly, it relates to a kind of data warehouse performance is surveyed
Method for testing and system.
Background technology
Data warehouse (Data Warehouse, DWH) is the collection for preserving the data produced in enterprise's production or R&D process
Close.Data warehouse is generally used for recording information data of the enterprise from past a certain time point to each current stage, and manager is led to
These data are crossed, quantitative analysis and prediction can be made to the development course of enterprise and future trend.Due to the work of data warehouse
Once causing certain data to enter after data warehouse with characteristic, generally will be retained for a long time, that is to say, that counting
It is usually inquiry operation according to the operation carried out in warehouse, modification and deletion action seldom, generally only need to regularly data loading
And refreshing.
Typically the data warehouse that a certain enterprise designs is being directed to the performance test of the data warehouse in prior art
Reach the standard grade after use, within the observation period of certain length, enterprise personnel carries out artificial test to the performance of the data warehouse,
In this process if it find that the problem for existing in a certain respect of the data warehouse cannot stand, then enter for this problem
The performance optimization of row local, lacks the method for carrying out comprehensive test for the performance of the data warehouse.
The content of the invention
To solve above-mentioned technical problem, the invention provides a kind of data warehouse performance method of testing and system, to realize
For the purpose of comprehensive test of the performance of data warehouse.
To realize above-mentioned technical purpose, following technical scheme is embodiments provided:
A kind of data warehouse performance method of testing, is applied to data warehouse, and the data warehouse performance method of testing includes:
Detect whether the data type that the data in the data warehouse are used matches with the size of data, if
It is then to be changed to the data type of unmatched data to meet the minimum data type of the call data storage;
The distribution key for detecting the physical table in the data warehouse chooses whether meet Even distribution entails, if it is not, then
According to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Detect whether the tables of data in the data warehouse meets ranks memory requirement, if it is not, then adjustment no enough row
The ranks storage mode of the tables of data of row memory requirement;
Detect whether the physical table in the data warehouse meets table zoning requirements, if it is, to meeting table subregion
Physical table carries out table subregion;
Detect and whether meet table connection request between the physical table that there is annexation in the data warehouse, if not,
Then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Detect whether an ordering statement of the physical table in the data warehouse meets sentence and select to require, if it is not, then right
Not meeting sentence selects an ordering statement of desired physical table to be corrected.
Optionally, whether an ordering statement of the physical table in the detection data warehouse meets sentence and selects to require,
If it is not, then an ordering statement of the physical table required not meeting sentence to select also includes after being corrected:
It is reverse to obtain all of All Jobs according to job dependence relation by the data output end of the data warehouse
Operation machining path;
Obtain operation average latency of the All Jobs machining path of each operation, operation average performance times and be
System resource averagely takes;
When the operation average latency, operation horizontal according to the All Jobs machining path of each operation for obtaining is performed both by
Between and current system resource averagely take, obtain the shortest path of each operation;
The operation average latency, operation average performance times and system resource according to the shortest path of each operation puts down
Take the optimization data warehouse.
Optionally, the data output end by the data warehouse, it is reverse to obtain all according to job dependence relation
The All Jobs machining path of operation includes:
By the data output end of the data warehouse, the work of last level that All Jobs data genaration needs is obtained
Industry;
The operation of last level needed according to All Jobs data genaration and job dependence relation, inversely obtain institute
There is the operation machining path of operation.
Optionally, the operation average latency of the All Jobs machining path for obtaining each operation, operation are average
Execution time and system resource averagely take including:
Operation stand-by period of All Jobs machining path of each operation in record predetermined period, the Job execution time and
System resource takes;
Operation stand-by period, Job execution time according to the All Jobs machining path of each operation in predetermined period,
System resource takes operation average latency, the operation average performance times of the All Jobs machining path for calculating each operation
Averagely take with system resource.
Optionally, the operation average latency of the shortest path according to each operation, operation average performance times
Averagely taking the optimization data warehouse with system resource includes:
The normal form requirement of data warehouse is reduced, increases data redundancy;
Operation is adjusted according to operation average latency of the shortest path of each operation and operation average performance times
Priority;
The concurrency for adjusting the data warehouse is averagely taken according to system resource.
A kind of data warehouse performance test system, including:
Data type detection module, for detecting the data warehouse in the data type that used of data and the number
Whether match according to size, if it is, the data type of unmatched data is changed to meet the call data storage most
Small data type;
Distribution key detection module, for detecting the data warehouse in the distribution key of physical table choose whether meet uniform
Spreading requirements, if it is not, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Storage mode detection module, for detecting the data warehouse in tables of data whether meet ranks memory requirement,
If it is not, then adjustment is unsatisfactory for the ranks storage mode of the tables of data of ranks memory requirement;
Subarea detecting module, for detecting the data warehouse in physical table whether meet table zoning requirements, if it is,
Then the physical table to meeting table subregion carries out table subregion;
Table connection detection module, for detecting the data warehouse in exist and whether meet between the physical table of annexation
Table connection request, if it is not, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Sequence detection module, for detecting the data warehouse in physical table an ordering statement whether meet sentence selection
Require, if it is not, then an ordering statement of the physical table required not meeting sentence to select is corrected.
Optionally, also include:
Machining path acquisition module, it is inverse according to job dependence relation for by the data output end of the data warehouse
To the All Jobs machining path for obtaining All Jobs;
Time-obtaining module, for obtaining the operation average latency of the All Jobs machining path of each operation, making
Industry average performance times and system resource averagely take;
Shortest path acquisition module, for impartial according to the operation horizontal of the All Jobs machining path of each operation for obtaining
Treat that time, operation average performance times and current system resource averagely take, obtain the shortest path of each operation;
Optimization module, during for being performed both by according to the operation average latency of the shortest path of each operation, operation horizontal
Between and system resource averagely take the optimization data warehouse.
Optionally, the machining path acquisition module includes:
Date Conversion Unit, needs for by the data output end of the data warehouse, obtaining All Jobs data genaration
The operation of last level wanted;
Backstepping unit, the operation of last level and job dependence for being needed according to All Jobs data genaration is closed
System, the reverse operation machining path for obtaining All Jobs.
Optionally, the time-obtaining module includes:
Recording unit, for record the All Jobs machining path of each operation in predetermined period the operation stand-by period,
Job execution time and system resource take;
Computing unit, for according to the operation stand-by period of the All Jobs machining path of each operation in predetermined period,
Job execution time, system resource take the operation average latency of the All Jobs machining path for calculating each operation, make
Industry average performance times and system resource averagely take.
Optionally, the optimization module includes:
Normal form unit, for reducing the normal form requirement of data warehouse, increases data redundancy;
Priority adjustment unit, for average according to the operation average latency of the shortest path of each operation and operation
The execution time adjusts the priority of operation;
Concurrency adjustment unit, for averagely taking the concurrency for adjusting the data warehouse according to system resource.
From above-mentioned technical proposal as can be seen that embodiments providing a kind of data warehouse performance method of testing and being
System, wherein, the data warehouse performance method of testing to the data type that the data in the data warehouse are used by entering
Row is detected and corrected, to reduce the data warehouse in data shared by memory space, lift the fortune of the data warehouse
Row performance;The table point of the distribution key of physical table, the ranks memory requirement of tables of data, physical table in by detecting the data warehouse
Area, there is table annexation between the physical table of annexation and an ordering statement of physical table is detected and corrected, with
The speed of service of the data warehouse is lifted, so as to realize that the mesh of all round properties test and optimization is carried out to the data warehouse
, find out problem and provide a feasible method using front carrying out all round properties test reaching the standard grade for the data warehouse.
Description of the drawings
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing
The accompanying drawing to be used needed for having technology description is briefly described, it should be apparent that, drawings in the following description are only this
Inventive embodiment, for those of ordinary skill in the art, on the premise of not paying creative work, can be with basis
The accompanying drawing of offer obtains other accompanying drawings.
A kind of schematic flow sheet of data warehouse performance method of testing that Fig. 1 is provided for one embodiment of the application;
A kind of flow process of data warehouse performance method of testing that Fig. 2 is provided for a preferred embodiment of the application is illustrated
Figure;
A kind of structural representation of data warehouse performance test system that Fig. 3 is provided for one embodiment of the application;
A kind of structural representation of data warehouse performance test system that Fig. 4 is provided for a preferred embodiment of the application
Figure.
Specific embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation is described, it is clear that described embodiment is only a part of embodiment of the invention, rather than the embodiment of whole.It is based on
Embodiment in the present invention, it is every other that those of ordinary skill in the art are obtained under the premise of creative work is not made
Embodiment, belongs to the scope of protection of the invention.
The embodiment of the present application provides a kind of data warehouse performance method of testing, as shown in figure 1, including:
S101:Detect whether the data type that the data in the data warehouse are used matches with the size of data,
If it is, the data type of unmatched data is changed to meet the minimum data type of the call data storage.
It should be noted that following during the selection of data type that used of the data in the data warehouse
Choosing the minimum data type of memory space occupancy as far as possible can greatly save data space, lift the data bins
The speed of service in storehouse.
Specifically, when the data in the data warehouse are stored, if a certain data simultaneously can be with data
When type char and varchar are stored, optimization is stored with varchar data type.This is because data type
Between char and varchar and nonexistence energy difference, but the data length of varchar is variable, and the data length of char
It is immutable, if the data that storage is stored with char data type, if the length of the data is less than can holding for char
The data length received, then vacant position with space character supplement, so these space characters will be taken needed for the data storage
Memory space beyond memory space, cause the waste of memory space, therefore optimize using data type varchar to the number
According to being stored.
For the data of numeric type, if the magnitude range of a certain data can (can with data type SMALLINT
Storage digital data scope -231-231- 1, take byte 2) stored, and if data data type BIGINT
(digital data scope -2 can be stored63-263- 1, take byte 2) when being stored, will result in a large amount of waves of memory space
Take.
In addition, the uniformity and normalization in order to keep data warehouse type as far as possible, and in order to the number
When being migrated according to the data in warehouse, the complexity of data conversion is reduced, the dtd--data type definition in data warehouse should not mistake
In complexity.
It is described to detect in the data warehouse in one embodiment of the application on the basis of above-described embodiment
The data type that data are used and the size of data whether match including:
The data type that data in the data warehouse are used is used as target type;
All data types of the memory space less than target type will be taken as set to be selected;
Judge with the presence or absence of the data type that disclosure satisfy that the call data storage in the set to be selected, if sentenced
The data of breaking are mismatched with the target type.
S102:The distribution key for detecting the physical table in the data warehouse chooses whether meet Even distribution entails, if
It is no, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse.
It should be noted that the distribution key of the physical table needs to meet to require:
Each physical table must show formulation distribution key, not allow to be set up using the mode of acquiescence distribution key;
Distribution key field is in principle 1, at most no more than 3;
The distribution key of associated physical table is as far as possible consistent;
For dimension table, code table, associate field should be used as distribution key;
Distribution key field can not perform update operations;
Avoid using random distribution strategy as far as possible, although its data distribution is uniform, but random distribution strategy always leads
Cause in query execution, data exchange among the nodes and migration affect execution performance.
S103:Detect whether the tables of data in the data warehouse meets ranks memory requirement, if it is not, then adjustment is discontented
The ranks storage mode of the tables of data of sufficient ranks memory requirement.
It should be noted that the tables of data in the data warehouse generally can be by the way of line storage
Stored.But if a certain tables of data can be stored in the way of column storage, then to this in the way of column storage
Tables of data carries out storing can save the internal memory shared when the tables of data is inquired about, so as to lift the operation of the data warehouse
Performance.So described ranks memory requirement is, and can be stored in the way of column storage in a certain tables of data, then with row
The mode of formula storage is stored to the tables of data.
One embodiment of the application is provided a kind of tables of data that can be stored in the way of column storage and is musted
The condition that must be met, including:
A) must add table using the table of column storage, principle upper table can only do insertion operation, seldom do deletion and more
Newly;
B) insertion operation in principle can not be very frequent;
C) under normal circumstances using row storage, only frequently inquiring about, and access module is only to access a small amount of field
In the case of, it is just preferential using row storage.
D) tables of data of the Column quantity more than 10.
S104:Detect whether the physical table in the data warehouse meets table zoning requirements, if it is, to meeting table point
The physical table in area carries out table subregion.
It should be noted that when a certain physical table in the data warehouse is very big, to it table multidomain treat-ment is carried out
It is divided into a lot of little and manageable parts, has carried out the physical table of table subregion when given query statement is performed, scanning is related
The data of partial data rather than whole physical tables, so as to improve query performance.Table subregion is for the management of data warehouse
It is helpful, such as it is easy to roll old data in data warehouse.
But not all physical table is suitable for doing table subregion, the physical table should be judged whether from the following aspects
Meet table zoning requirements:
A) whether table is sufficiently large
The fact that only very big, table was just adapted to do table subregion (generally, when the data record of a true table surpasses
When crossing 100,000,000, we are considered as it and meet this requirement).If having several hundred million records in a true table, logically
True table is divided into less subregion can improve performance.And table for only tens thousand of or the fact that less record, to point
The administration overhead that area is carried out in advance is by much larger than the performance improvement that can be obtained.
Wherein, true table refers to the physical table for needing constantly to carry out data renewal.
B) whether querying condition can match subregion condition
Check whether the WHERE conditions of query statement are consistent with the field for considering table subregion.For example, if most look into
Ask use date condition, then the date zoning design according to the moon or week is perhaps very useful, and if querying condition is more
It is territory of use's condition, it may be considered that true table is done the subregion of list type for territory of use.
C) whether data warehouse needs to roll historical data
The rolling demand of historical data is also the Consideration of zoning design.Such as, only need to retain in data warehouse
Go bimestrial data.If data carry out subregion according to the moon, two months data before can will be easily deleted, and
Nearest data are stored in the subregion in nearest month.
D) whether can uniformly be decoupled according to certain regular data
Data are uniformly decoupled rule as far as possible should be selected.If the data volume of each subregion storage is suitable, then inquiry
The improvement of performance will be related to the quantity of subregion.For example, a table is divided into 10 subregions, hits the inquiry of single subregion condition
Sweep table performance will than non-subregion in the case of it is high 10 times.
S105:Detect and whether meet table connection request between the physical table that there is annexation in the data warehouse, such as
Really no, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request.
Wherein, the table connection request is specifically included:
The quantity of multilist association does not exceed 15 in one SQL statement.
The type of attachment for producing cartesian product is prohibitted the use of (if CROSS JOIN or two tables of association are in associate field
It is upper to there is a large amount of repetition).
When multiple physical tables are associated, the less physical table of association results preferentially does table connection;
When big/big/little three physical tables are inline, it is to avoid first two big tables are carried out table connection, unless filterability is very
By force;
When big/little/little three tables are inline, preferentially two little tables are carried out table connection;
When two big physical tables are associated, two physical tables that there are a large amount of repetition records are prohibitted the use of to be closed
Connection, in order to avoid producing huge intermediate result because repeating record association, causes increasing substantially for disk occupation proportion;For example, if
The physical table table of the repetition record of 1,000,000 and the physical table association of the record of the repetition of 10,000, as a result can be up to 1,000,000 *
10000=10,000,000,000 records.
S106:Detect whether an ordering statement of the physical table in the data warehouse meets sentence and select to require, if not,
An ordering statement of the physical table for then requiring not meeting sentence to select is corrected.
It should be noted that the sentence selects to require to include:
A) ORDER BY an ordering statements should not be in the view used, in the view, an ordering statement can be ignored;
B) ORDER BY an ordering statements executory cost is very high, if other optional an ordering statements, it is to avoid use;
C) sorting operation should not be performed on big data set;
D) PartitI/On By an ordering statements inside realizes that needs sort the data, if data volume is more than hundred million ranks,
Realized using group by an ordering statements.
On the basis of above-described embodiment, in a preferred embodiment of the application, as shown in Fig. 2 detection institute
Whether an ordering statement for stating the physical table in data warehouse meets sentence selection requirement, if it is not, then selecting not meeting sentence
An ordering statement of the physical table of requirement also includes after being corrected:
S107:By the data output end of the data warehouse, according to job dependence relation, reverse acquisition All Jobs
All Jobs machining path.
It should be noted that should be noted operation during the reverse All Jobs machining path for obtaining All Jobs
The integrality of dependence, reasonability.The problem of Circular dependency is checked whether there is emphatically.For the Circular dependency for occurring, need
Further confirm that with developer, it is to avoid due to reprocessabilty caused by job dependence configuration error, cause the wave of computing resource
Take.
On the basis of above-described embodiment, in one embodiment of the application, the number by the data warehouse
According to output end, according to job dependence relation, the reverse All Jobs machining path for obtaining All Jobs includes:
S1071:By the data output end of the data warehouse, last layer that All Jobs data genaration needs is obtained
The operation of level;
S1072:The operation of last level needed according to All Jobs data genaration and job dependence relation, inversely
Obtain the operation machining path of All Jobs.
S108:When obtaining the operation average latency of the All Jobs machining path of each operation, operation horizontal and being performed both by
Between and system resource averagely take.
On the basis of above-described embodiment, in another preferred embodiment of the application, each operation of acquisition
Operation average latency of All Jobs machining path, operation average performance times and system resource averagely take including:
S1081:Operation stand-by period, the Job execution of the All Jobs machining path of each operation in record predetermined period
Time and system resource take;
Explanation is needed exist for, the resource such as system CPU and I/O disappears when the system resource occupancy refers to Job execution
Consumption.
S1082:According to operation stand-by period, the Job execution of the All Jobs machining path of each operation in predetermined period
Time, system resource take and calculate operation average latency of All Jobs machining path of each operation, operation and averagely hold
Row time and system resource averagely take.
It should be noted that the length of the predetermined period is typically chosen for one week, but it is also possible to elect as two weeks, January or
1 year etc..The application is not limited this, concrete depending on actual conditions.
S109:The operation average latency, operation according to the All Jobs machining path of each operation for obtaining is average
Execution time and current system resource averagely take, and obtain the shortest path of each operation;
S110:The operation average latency, operation average performance times and system according to the shortest path of each operation
Resource averagely takes the optimization data warehouse.
It is described according to each operation in another preferred embodiment of the application on the basis of above-described embodiment
Operation average latency of shortest path, operation average performance times and system resource averagely take the optimization data warehouse
Including:
S111:The normal form requirement of data warehouse is reduced, increases data redundancy.
The purpose for increasing data redundancy is to exchange subtracting for calculating time and I/O operation time for many occupancy of memory space
It is few, so as to improve the ageing of data processing.
S112:The operation average latency and operation average performance times adjustment work according to the shortest path of each operation
The priority of industry.
It should be noted that according to data warehouse rule general in the industry, job priority can be generally divided into A, B, C etc.
Different stage.Average latency longer job priority in critical path is improved, is shorten the average latency most simple
Single effectively method.After improving priority, there can be bigger priority to obtain related resource in the case of contention for resources.
S113:The concurrency for adjusting the data warehouse is averagely taken according to system resource.
It should be noted that during operation average performance times, with Job execution all kinds of resources of system busy extent
It is in close relations.If during Job execution, the resource utilization such as data warehouse CPU, I/O it is higher (>80%), then can be with
The concurrency (while the operation quantity for carrying out) of the appropriate reduction data warehouse, limits the requisition of resource.So can protect
Operation in card critical path, can obtain more resources, relative to can ensure that operation is faster finished.
Accordingly, the embodiment of the present application additionally provides a kind of data warehouse performance test system, as shown in figure 3, including:
Data type detection module 100, for detecting the data warehouse in the data type that used of data and institute
State whether size of data matches, if it is, the data type of unmatched data is changed to meet the call data storage
Minimum data type;
Distribution key detection module 200, for detecting the data warehouse in physical table distribution key choose whether meet
Even distribution entails, if it is not, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Storage mode detection module 300, for detecting the data warehouse in tables of data whether meet ranks storage will
Ask, if it is not, then adjustment is unsatisfactory for the ranks storage mode of the tables of data of ranks memory requirement;
Subarea detecting module 400, for detecting the data warehouse in physical table whether meet table zoning requirements, if
It is that then the physical table to meeting table subregion carries out table subregion;
Table connection detection module 500, for detecting the data warehouse in whether exist between the physical table of annexation
Table connection request is met, if it is not, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Sequence detection module 600, for detecting the data warehouse in an ordering statement of physical table whether meet sentence
Select to require, if it is not, then an ordering statement of the physical table required not meeting sentence to select is corrected.
It should be noted that what the data type detection module 100 performed is data type to the data warehouse
Detection, follow during the selection of the data type that the data in the data warehouse are used as far as possible choose storage
The minimum data type of space hold can greatly save data space, lift the speed of service of the data warehouse.
Specifically, when the data in the data warehouse are stored, if a certain data simultaneously can be with data
When type char and varchar are stored, optimization is stored with varchar data type.This is because data type
Between char and varchar and nonexistence energy difference, but the data length of varchar is variable, and the data length of char
It is immutable, if the data that storage is stored with char data type, if the length of the data is less than can holding for char
The data length received, then vacant position with space character supplement, so these space characters will be taken needed for the data storage
Memory space beyond memory space, cause the waste of memory space, therefore optimize using data type varchar to the number
According to being stored.
For the data of numeric type, if the magnitude range of a certain data can (can with data type SMALLINT
Storage digital data scope -231-231- 1, take byte 2) stored, and if data data type BIGINT
(digital data scope -2 can be stored63-263- 1, take byte 2) when being stored, will result in a large amount of waves of memory space
Take.
In addition, the uniformity and normalization in order to keep data warehouse type as far as possible, and in order to the number
When being migrated according to the data in warehouse, the complexity of data conversion is reduced, the dtd--data type definition in data warehouse should not mistake
In complexity.
It is described to detect in the data warehouse in one embodiment of the application on the basis of above-described embodiment
The data type that data are used and the size of data whether match including:
The data type that data in the data warehouse are used is used as target type;
All data types of the memory space less than target type will be taken as set to be selected;
Judge with the presence or absence of the data type that disclosure satisfy that the call data storage in the set to be selected, if sentenced
The data of breaking are mismatched with the target type.
The distribution key detection module 200 mainly follows following original when the distribution key to the physical table is detected
Then:
Each physical table must show formulation distribution key, not allow to be set up using the mode of acquiescence distribution key;
Distribution key field is in principle 1, at most no more than 3;
The distribution key of associated physical table is as far as possible consistent;
For dimension table, code table, associate field should be used as distribution key;
Distribution key field can not perform update operations;
Avoid using random distribution strategy as far as possible, although its data distribution is uniform, but random distribution strategy always leads
Cause in query execution, data exchange among the nodes and migration affect execution performance.
Also, it should be noted that the tables of data in the data warehouse can generally adopt the side of line storage
Formula is stored.But if a certain tables of data can by column storage in the way of be stored, then by column storage in the way of pair
The tables of data carries out storing can save the internal memory shared when the tables of data is inquired about, so as to lift the fortune of the data warehouse
Row performance.So described ranks memory requirement is, a certain tables of data can by column storage in the way of be stored, then with
The mode of column storage is stored to the tables of data.
One embodiment of the application is provided a kind of tables of data that can be stored in the way of column storage and is musted
The condition that must be met, including:
A) must add table using the table of column storage, principle upper table can only do insertion operation, seldom do deletion and more
Newly;
B) insertion operation in principle can not be very frequent;
C) under normal circumstances using row storage, only frequently inquiring about, and access module is only to access a small amount of field
In the case of, it is just preferential using row storage.
D) tables of data of the Column quantity more than 10.
When a certain physical table in the data warehouse is very big, it should using the Subarea detecting module 400 to it
Carry out table multidomain treat-ment and be divided into a lot of little and manageable parts, the physical table for having carried out table subregion is performing given inquiry language
During sentence, the data of relevant portion rather than the data of whole physical tables are scanned, so as to improve query performance.Table subregion is for data
The management in warehouse is also helpful, such as be easy to roll old data in data warehouse.
But not all physical table is suitable for doing table subregion, the physical table should be judged whether from the following aspects
Meet table zoning requirements:
A) whether table is sufficiently large
The fact that only very big, table was just adapted to do table subregion (generally, when the data record of a true table surpasses
When crossing 100,000,000, we are considered as it and meet this requirement).If having several hundred million records in a true table, logically
True table is divided into less subregion can improve performance.And table for only tens thousand of or the fact that less record, to point
The administration overhead that area is carried out in advance is by much larger than the performance improvement that can be obtained.
Wherein, true table refers to the physical table for needing constantly to carry out data renewal.
B) whether querying condition can match subregion condition
Check whether the WHERE conditions of query statement are consistent with the field for considering table subregion.For example, if most look into
Ask use date condition, then the date zoning design according to the moon or week is perhaps very useful, and if querying condition is more
It is territory of use's condition, it may be considered that true table is done the subregion of list type for territory of use.
C) whether data warehouse needs to roll historical data
The rolling demand of historical data is also the Consideration of zoning design.Such as, only need to retain in data warehouse
Go bimestrial data.If data carry out subregion according to the moon, two months data before can will be easily deleted, and
Nearest data are stored in the subregion in nearest month.
D) whether can uniformly be decoupled according to certain regular data
Data are uniformly decoupled rule as far as possible should be selected.If the data volume of each subregion storage is suitable, then inquiry
The improvement of performance will be related to the quantity of subregion.For example, a table is divided into 10 subregions, hits the inquiry of single subregion condition
Sweep table performance will than non-subregion in the case of it is high 10 times.
The table connection request that the table connection detection module 500 is followed when table connecting detection is carried out to the data warehouse
Specifically include:
The quantity of multilist association does not exceed 15 in one SQL statement.
The type of attachment for producing cartesian product is prohibitted the use of (if CROSS JOIN or two tables of association are in associate field
It is upper to there is a large amount of repetition).
When multiple physical tables are associated, the less physical table of association results preferentially does table connection;
When big/big/little three physical tables are inline, it is to avoid first two big tables are carried out table connection, unless filterability is very
By force;
When big/little/little three tables are inline, preferentially two little tables are carried out table connection;
When two big physical tables are associated, two physical tables that there are a large amount of repetition records are prohibitted the use of to be closed
Connection, in order to avoid producing huge intermediate result because repeating record association, causes increasing substantially for disk occupation proportion;For example, if
The physical table table of the repetition record of 1,000,000 and the physical table association of the record of the repetition of 10,000, as a result can be up to 1,000,000 *
10000=10,000,000,000 records.
The sentence detection that the sequence detection module 600 is followed is required to include:
A) ORDER BY an ordering statements should not be in the view used, in the view, an ordering statement can be ignored;
B) ORDER BY an ordering statements executory cost is very high, if other optional an ordering statements, it is to avoid use;
C) sorting operation should not be performed on big data set;
D) PartitI/On By an ordering statements inside realizes that needs sort the data, if data volume is more than hundred million ranks,
Realized using group by an ordering statements.
On the basis of above-described embodiment, in a preferred embodiment of the application, as shown in figure 4, the data bins
Storehouse Performance Test System also includes:
Machining path acquisition module 700, for by the data output end of the data warehouse, closing according to job dependence
System, the reverse All Jobs machining path for obtaining All Jobs;
Time-obtaining module 800, for obtain the All Jobs machining path of each operation the operation average latency,
Operation average performance times and system resource averagely take;
Shortest path acquisition module 900, for according to the operation horizontal of the All Jobs machining path of each operation for obtaining
Stand-by period, operation average performance times and current system resource averagely takes, and obtains the shortest path of each operation;
Optimization module 1000, for averagely being held according to the operation average latency of the shortest path of each operation, operation
Row time and system resource averagely take the optimization data warehouse.
It should be noted that should be noted operation during the reverse All Jobs machining path for obtaining All Jobs
The integrality of dependence, reasonability.The problem of Circular dependency is checked whether there is emphatically.For the Circular dependency for occurring, need
Further confirm that with developer, it is to avoid due to reprocessabilty caused by job dependence configuration error, cause the wave of computing resource
Take.
On the basis of above-described embodiment, in one embodiment of the application, the machining path acquisition module 700 is wrapped
Include:
Date Conversion Unit, needs for by the data output end of the data warehouse, obtaining All Jobs data genaration
The operation of last level wanted;
Backstepping unit, the operation of last level and job dependence for being needed according to All Jobs data genaration is closed
System, the reverse operation machining path for obtaining All Jobs.
On the basis of above-described embodiment, in another preferred embodiment of the application, the time-obtaining module 800
Including:
Recording unit, for record the All Jobs machining path of each operation in predetermined period the operation stand-by period,
Job execution time and system resource take;
Computing unit, for according to the operation stand-by period of the All Jobs machining path of each operation in predetermined period,
Job execution time, system resource take the operation average latency of the All Jobs machining path for calculating each operation, make
Industry average performance times and system resource averagely take.
Explanation is needed exist for, the resource such as system CPU and I/O disappears when the system resource occupancy refers to Job execution
Consumption.
The length of the predetermined period is typically chosen for one week, but it is also possible to elect two weeks, January or 1 year etc. as.The application
This is not limited, it is concrete depending on actual conditions.
On the basis of above-described embodiment, in another preferred embodiment of the application, the optimization module 1000 is wrapped
Include:
Normal form unit, for reducing the normal form requirement of data warehouse, increases data redundancy;
Priority adjustment unit, for average according to the operation average latency of the shortest path of each operation and operation
The execution time adjusts the priority of operation;
Concurrency adjustment unit, for averagely taking the concurrency for adjusting the data warehouse according to system resource.
It should be noted that the purpose for increasing data redundancy is to exchange calculating time and I/O for many occupancy of memory space
The reduction of operating time, so as to improve the ageing of data processing.
According to the rule that data warehouse is general in the industry, job priority can be generally divided into the different stages such as A, B, C.Improve
Average latency longer job priority in critical path, is the most simple effective method for shortening the average latency.
After improving priority, there can be bigger priority to obtain related resource in the case of contention for resources.
Also, it should be noted that during operation average performance times and Job execution all kinds of resources of system busy extent
It is in close relations.If during Job execution, the resource utilization such as data warehouse CPU, I/O it is higher (>80%), then can be with
The concurrency (while the operation quantity for carrying out) of the appropriate reduction data warehouse, limits the requisition of resource.So can protect
Operation in card critical path, can obtain more resources, relative to can ensure that operation is faster finished.
In sum, the embodiment of the present application provides a kind of data warehouse performance method of testing and system, wherein, the number
According to warehouse performance test methods by being detected and corrected to the data type that the data in the data warehouse are used, with
The memory space shared by the data in the data warehouse is reduced, the runnability of the data warehouse is lifted;By detection
The distribution key of physical table in the data warehouse, the ranks memory requirement of tables of data, the table subregion of physical table, there is annexation
Physical table between table annexation and an ordering statement of physical table be detected and corrected, to lift the data warehouse
The speed of service, be the data warehouse so as to realize that the purpose of all round properties test and optimization is carried out to the data warehouse
Find out problem there is provided a feasible method using front carrying out all round properties test reaching the standard grade.
Each embodiment is described by the way of progressive in this specification, and what each embodiment was stressed is and other
The difference of embodiment, between each embodiment identical similar portion mutually referring to.
The foregoing description of the disclosed embodiments, enables professional and technical personnel in the field to realize or using the present invention.
Various modifications to these embodiments will be apparent for those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, the present invention
The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one
The most wide scope for causing.
Claims (10)
1. a kind of data warehouse performance method of testing, it is characterised in that be applied to data warehouse, the data warehouse performance test
Method includes:
Detect whether the data type that the data in the data warehouse are used matches with the size of data, if it is,
The data type of unmatched data is changed to meet the minimum data type of the call data storage;
The distribution key for detecting the physical table in the data warehouse chooses whether meet Even distribution entails, if it is not, then foundation
It is uniformly distributed the distribution key that principle chooses the physical table in the data warehouse;
Detect whether the tables of data in the data warehouse meets ranks memory requirement, if it is not, then adjustment is unsatisfactory for ranks and deposits
The ranks storage mode of the tables of data that storage is required;
Detect whether the physical table in the data warehouse meets table zoning requirements, if it is, the physics to meeting table subregion
Table carries out table subregion;
Detect and whether meet table connection request between the physical table that there is annexation in the data warehouse, if it is not, then right
The physical table for being unsatisfactory for table connection request is reconnected according to table connection request;
Detect whether an ordering statement of the physical table in the data warehouse meets sentence and select to require, if it is not, then to not being inconsistent
Closing sentence selects an ordering statement of desired physical table to be corrected.
2. data warehouse performance method of testing according to claim 1, it is characterised in that the detection data warehouse
In physical table an ordering statement whether meet sentence select require, if it is not, then to do not meet sentence select require physics
An ordering statement of table also includes after being corrected:
By the data output end of the data warehouse, according to job dependence relation, the reverse All Jobs for obtaining All Jobs
Machining path;
Obtain operation average latency, operation average performance times and the system money of the All Jobs machining path of each operation
Source averagely takes;
Operation average latency of All Jobs machining path, operation average performance times according to each operation for obtaining and
Current system resource averagely takes, and obtains the shortest path of each operation;
The operation average latency, operation average performance times and system resource according to the shortest path of each operation averagely accounts for
With the optimization data warehouse.
3. data warehouse performance method of testing according to claim 2, it is characterised in that described by the data warehouse
Data output end, according to job dependence relation, the reverse All Jobs machining path for obtaining All Jobs includes:
By the data output end of the data warehouse, the operation of last level that All Jobs data genaration needs is obtained;
The operation of last level needed according to All Jobs data genaration and job dependence relation, inversely obtain all works
The operation machining path of industry.
4. data warehouse performance method of testing according to claim 2, it is characterised in that the institute of each operation of acquisition
Operation average latency, operation average performance times and the system resource for having operation machining path averagely take including:
The operation stand-by period of the All Jobs machining path of each operation, Job execution time and system in record predetermined period
Resource occupation;
Operation stand-by period, Job execution time according to the All Jobs machining path of each operation in predetermined period, system
Resource occupation calculates operation average latency of the All Jobs machining path of each operation, operation average performance times and is
System resource averagely takes.
5. data warehouse performance method of testing according to claim 2, it is characterised in that it is described according to each operation most
Operation average latency of short path, operation average performance times and system resource averagely take the optimization data warehouse bag
Include:
The normal form requirement of data warehouse is reduced, increases data redundancy;
The preferential of operation is adjusted according to operation average latency of the shortest path of each operation and operation average performance times
Level;
The concurrency for adjusting the data warehouse is averagely taken according to system resource.
6. a kind of data warehouse performance test system, it is characterised in that include:
Data type detection module, for detecting the data warehouse in the data type that used of data it is big with the data
It is little whether to match, if it is, the data type of unmatched data is changed to meet the minimum number of the call data storage
According to type;
Distribution key detection module, for detecting the data warehouse in the distribution key of physical table choose and whether meet and be uniformly distributed
Require, if it is not, then according to the distribution key of the physical table being uniformly distributed in the principle selection data warehouse;
Storage mode detection module, for detecting the data warehouse in tables of data whether meet ranks memory requirement, if
No, then adjustment is unsatisfactory for the ranks storage mode of the tables of data of ranks memory requirement;
Subarea detecting module, for detecting the data warehouse in physical table whether meet table zoning requirements, if it is, right
Meeting the physical table of table subregion carries out table subregion;
Table connection detection module, for detecting the data warehouse in have table company whether is met between the physical table of annexation
Requirement is connect, if it is not, then the physical table to being unsatisfactory for table connection request is reconnected according to table connection request;
Sequence detection module, for detecting the data warehouse in physical table an ordering statement whether meet sentence select will
Ask, if it is not, then an ordering statement of the physical table required not meeting sentence to select is corrected.
7. data warehouse performance test system according to claim 6, it is characterised in that also include:
Machining path acquisition module, for by the data output end of the data warehouse, according to job dependence relation, inversely obtaining
Take the All Jobs machining path of All Jobs;
Time-obtaining module, for obtaining operation average latency, the operation horizontal of the All Jobs machining path of each operation
It is performed both by the time and system resource averagely takes;
Shortest path acquisition module, during for according to the operation average waiting of the All Jobs machining path of each operation for obtaining
Between, operation average performance times and current system resource averagely take, obtain the shortest path of each operation;
Optimization module, for according to the operation average latency of the shortest path of each operation, operation average performance times and
System resource averagely takes the optimization data warehouse.
8. data warehouse performance test system according to claim 7, it is characterised in that the machining path acquisition module
Including:
Date Conversion Unit, for by the data output end of the data warehouse, obtaining what All Jobs data genaration needed
The operation of last level;
Backstepping unit, for the operation of last level that needed according to All Jobs data genaration and job dependence relation,
The reverse operation machining path for obtaining All Jobs.
9. data warehouse performance test system according to claim 7, it is characterised in that the time-obtaining module bag
Include:
Recording unit, for recording operation stand-by period, the operation of the All Jobs machining path of each operation in predetermined period
Execution time and system resource take;
Computing unit, for according to operation stand-by period, the operation of the All Jobs machining path of each operation in predetermined period
Execution time, system resource take operation average latency, the operation horizontal of the All Jobs machining path for calculating each operation
It is performed both by the time and system resource averagely takes.
10. data warehouse performance test system according to claim 7, it is characterised in that the optimization module includes:
Normal form unit, for reducing the normal form requirement of data warehouse, increases data redundancy;
Priority adjustment unit, for being performed both by according to the operation average latency of the shortest path of each operation and operation horizontal
Time adjusts the priority of operation;
Concurrency adjustment unit, for averagely taking the concurrency for adjusting the data warehouse according to system resource.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611220973.1A CN106649754B (en) | 2016-12-26 | 2016-12-26 | Data warehouse performance test method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611220973.1A CN106649754B (en) | 2016-12-26 | 2016-12-26 | Data warehouse performance test method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106649754A true CN106649754A (en) | 2017-05-10 |
CN106649754B CN106649754B (en) | 2021-04-20 |
Family
ID=58827633
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611220973.1A Active CN106649754B (en) | 2016-12-26 | 2016-12-26 | Data warehouse performance test method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106649754B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112416752A (en) * | 2020-11-02 | 2021-02-26 | 四川新网银行股份有限公司 | ETL (extract transform load) layered test method based on data warehouse |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8019795B2 (en) * | 2007-12-05 | 2011-09-13 | Microsoft Corporation | Data warehouse test automation framework |
CN103390020A (en) * | 2012-05-10 | 2013-11-13 | 西门子公司 | Method and system for storing data in database |
-
2016
- 2016-12-26 CN CN201611220973.1A patent/CN106649754B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8019795B2 (en) * | 2007-12-05 | 2011-09-13 | Microsoft Corporation | Data warehouse test automation framework |
CN103390020A (en) * | 2012-05-10 | 2013-11-13 | 西门子公司 | Method and system for storing data in database |
Non-Patent Citations (3)
Title |
---|
1生有你: "SQL性能优化", 《HTTPS://WWW.CNBLOGS.COM/YSYN/P/5445064.HTML,博客园》 * |
刘南海等: "大数据时代运营商分析支撑域转型的实践与思考", 《电信科学》 * |
李小庆: "银行数据仓库性能的深度优化", 《中国金融电脑》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112416752A (en) * | 2020-11-02 | 2021-02-26 | 四川新网银行股份有限公司 | ETL (extract transform load) layered test method based on data warehouse |
CN112416752B (en) * | 2020-11-02 | 2023-06-06 | 四川新网银行股份有限公司 | Data warehouse ETL (extract-transform-load) layered test method |
Also Published As
Publication number | Publication date |
---|---|
CN106649754B (en) | 2021-04-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5758146A (en) | Method and apparatus for optimizing data retrieval using index scanning | |
US7676451B2 (en) | Selective database statistics recollection | |
US20190303876A1 (en) | Nearest known person directory function | |
CN105095413B (en) | A kind of method and device solving data skew | |
US20050131947A1 (en) | Data processing system and method | |
US6654756B1 (en) | Combination of mass storage sizer, comparator, OLTP user defined workload sizer, and design | |
WO2017183065A1 (en) | Device and method for tuning relational database | |
US8650224B2 (en) | Batching content management operations to facilitate efficient database interactions | |
CN105069134A (en) | A Method of Automatic Collection of Oracle Statistical Information | |
CN103984726A (en) | Local revision method for database execution plan | |
US11726975B2 (en) | Auto unload | |
CN100367278C (en) | Device and method for archiving and inquiry historical data | |
CN106485409A (en) | A kind of workload apparatus for evaluating and method | |
US6304871B1 (en) | Method and system for characterizing applications for use with databases having structured query language interfaces | |
CN108805458A (en) | A kind of enterprise technology Competitiveness Assessment method and device | |
US20060122963A1 (en) | System and method for performing a data uniqueness check in a sorted data set | |
CN106649754A (en) | Method and system for testing data warehouse performance | |
Kuhn et al. | Expert Indexing in Oracle Database 11g | |
Müller et al. | Efficient view maintenance for enterprise applications in columnar in-memory databases | |
Dvoretskyi et al. | Data Utility Assessment while Optimizing the Structure and Minimizing the Volume of a Distributed Database Node. | |
CN114969025A (en) | Data wide table optimization processing method, system, device and medium | |
JP2017010376A (en) | Mart-less verification support system and mart-less verification support method | |
Dean et al. | Performance optimization of the open xdmod datawarehouse | |
CN114184219B (en) | Metering device verification distribution method, metering device verification distribution system, storage medium and electronic equipment | |
US8620902B2 (en) | Computer program products and methods for query collection optimization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |