HK1132815B - Methods and apparatus for improving data warehouse performance - Google Patents
Methods and apparatus for improving data warehouse performance Download PDFInfo
- Publication number
- HK1132815B HK1132815B HK10100275.4A HK10100275A HK1132815B HK 1132815 B HK1132815 B HK 1132815B HK 10100275 A HK10100275 A HK 10100275A HK 1132815 B HK1132815 B HK 1132815B
- Authority
- HK
- Hong Kong
- Prior art keywords
- data
- hierarchical directory
- file path
- directory file
- path
- Prior art date
Links
Description
Cross reference to related applications
Priority of U.S. provisional application No.60/816,423 entitled "Methods and apparatus for Improving Data Warehouse Performance", filed on 26.6.2006, is hereby incorporated by reference in its entirety.
Technical Field
The present invention relates generally to data formatting and analysis, and more particularly to a method and apparatus for improving data warehouse performance.
Background
The storage, retrieval, and processing of various types of data allows enterprises to plan new business plans and/or adjust existing business plans for optimal performance. For example, a wireless telephone company may manage a data warehouse that stores existing user information and/or new user information. If the wireless company processes such collected data, it may find a growing trend that suggests new market opportunities and/or an overburdened market that requires recovery and/or additional service equipment. Analyzing business data stored in a data warehouse may enable companies to translate such data into business intelligence, better understand their customers, and/or make various management decisions based on empirical information rather than heuristics.
Non-commercial organizations may also analyze warehoused data to meet various design objectives. For example, government and/or municipal departments may study population growth and/or population movement to better provision urban infrastructure and/or services for increased use. Similarly, such population trend analysis can illustrate that the allocated budget can be re-allocated from less demanding urban projects to more preferred projects that are closely related to high population growth rates.
The data warehouse may maintain copies of the data for subsequent analysis. The data may include demographic data, financial data, business data, and/or behavioral data, such as cable television subscriptions, home purchasing behavior, and/or broadcast program consumption. Data warehouses may be stored in a variety of ways, including by relational databases, multidimensional databases, flat files, hierarchical databases, object databases, and so forth. Reports generated from the data warehouse are typically created to reveal specific indicators that are important to businesses, government entities, and/or other groups. Typically, the reports consume a limited amount of processing and storage resources, which may result in degraded data warehouse performance as the size of the stored data increases.
Moreover, if multiple clients seek reports from a particular data warehouse in overlapping times, the reduced performance capabilities may result in unsatisfactory client latency and/or inability to run queries in a manner that responds to the client's rapidly shifting demand. For example, some data warehouses may require many hours of processing time to generate reports to customers. If the customer next selects another set of parameters for reporting, the customer must wait a significant amount of time to run the query on the overloaded data warehouse at the next opportunity. At other times, depending on the throughput of the data warehouse, the processing time may be reduced, making it difficult for the data warehouse manager to efficiently allocate and/or manage data warehouse processing time to multiple clients.
Drawings
FIG. 1 is a block diagram illustrating an example of a data warehouse system.
FIG. 2 is a block diagram of the example data warehouse optimizer of FIG. 1.
Fig. 3 is a flow diagram of a representative example of machine readable instructions that may be executed to implement the example data warehouse optimizer of fig. 1 and 2.
FIG. 4 is a flow diagram of an exemplary example of machine-readable instructions that may be executed to obtain data for the example instructions of FIG. 3.
FIG. 5 is a flow diagram of a representative example of machine-readable instructions that may be executed to process data in the example instructions of FIG. 3.
FIG. 6 is a block diagram of an example data formatter to improve data warehouse performance.
FIG. 7 is an example data warehouse architecture and corresponding hierarchical directory paths for use in the example system of FIG. 1.
FIG. 8 is a flowchart representative of example machine readable instructions that may be executed to organize the example hierarchical directory path of FIG. 7.
FIG. 9 is a block diagram of another example data warehouse optimizer.
Fig. 10 is a block diagram of an example processor system that may be used to execute the example machine readable instructions of fig. 3-5 and/or 8 to implement the example systems and/or methods described herein.
Detailed Description
The data warehouse collects a large amount of information about one or more topics of interest, such as consumption habits of consumers and/or businesses, transactional data (e.g., financial market transactions, real estate transactions, etc.), population growth information, and/or multimedia broadcast viewing behavior, to name a few. Enterprises that maintain and control data warehouse facilities and/or provide data storage, processing, analysis, and/or report generation services face challenges created by this growth. In general, as data warehouse storage capacity is consumed, additional processing power and/or processing cycles are required to analyze an increasing amount of collected data.
As applied herein, data analysis (processing) may include, but is not limited to, searching for trends between collected data parameters, performing statistical functions on the collected data, accumulating computational results, and/or generating reports reflecting query results for review by an end user. The data processing, analysis, and/or storage functions of the data warehouse compete for processing resources, thereby burdening the data warehouse system as the data collection grows.
Data warehouse enterprises that provide such storage and/or processing services to customers sometimes attempt to manage customer expectations by adding processors to computer servers, adding storage space to data warehouse systems, and/or implementing data warehouse subsystems. In addition to changing and/or updating the data warehouse infrastructure to improve performance, these data warehouse businesses may also provide advance notice to their customers of turnaround times to complete reports to ensure that the customers are not disappointed. If the customer knows in advance how long to get to the specified report, the customer's expectations may be in full compliance with the facts, thereby increasing customer satisfaction. However, lengthy report generation processing times, whether or not the customer is notified in advance, may still lead to customer dissatisfaction. Further, the report generation processing time may depend on the utilization rate at different times of the day, which may cause unexpected fluctuations in the expected processing time, making it difficult to accurately provide advance notice to the client.
To accommodate different customers, a data warehouse enterprise may use a number of dedicated data stacks (datamarts). As used herein, a "dedicated data stack" is one or more smaller databases that are focused on one or more particular services and/or topics. However, despite the fact that the size of the dedicated data stack is smaller than the size of the aggregated data warehouse, the dedicated data stack may grow to a large scale, thereby negatively impacting report generation and other database performance characteristics.
The example methods and apparatus described herein may be used to improve data warehouse performance. In general, the example methods and apparatus illustrated herein may be used locally by a data warehouse and/or provided as an Application Programming Interface (API) to one or more clients to run on one or more personal computers, servers, and/or other computing platforms. The API may include a Graphical User Interface (GUI), and/or the data warehouse may provide comprehensive solutions to their customers in response to simple web-based report requests.
Fig. 1 is a block diagram of an example data warehouse system 100 constructed in accordance with the teachings of the present disclosure. The example system 100 includes data analysis instructions 102 that are fed to an input of a data warehouse optimizer 104. The data analysis instructions 102 may be constructed, for example, as an extensible markup language (XML) file containing details about the client's objectives. The data analysis instructions 102 may be written in any computer language and/or may take any alternative form, such as a text file, a character delimited file, and/or a binary file that may be parsed into specific data parameters, statistical methods, and/or report format parameters, without limitation. For example, a national broadcaster client may request a report containing ratings data for broadcast programs, news events, and/or commercial ratings. In addition, the broadcaster client may need statistically separate sub-reports to determine, for example, which age group of viewers watched one or more particular broadcast programs. Also, the broadcast client may require such statistically segregated sub-reports to segregate ratings information by geographic boundaries such as country, region, state, county, and/or city. One of ordinary skill in the art will appreciate that the data stored and/or processed by the data warehouse is not limited to ratings information and may include any type of information. For example, data warehouse information may include, but is not limited to: financial transaction information, demographic information, subscription service consumption information, medical information, usage and/or supply of prescription drugs, sales information (e.g., point-of-sale information), weather information, and/or behavioral information.
As discussed in further detail below, the data of the illustrated examplesThe warehouse optimizer 104 communicates with one or more data warehouses 106, 108, 110. Each data warehouse 106, 108, 110 may include one or more databases 112 to store one or more types of data. In the illustrated example, one data warehouse 106 is communicatively connected directly to the data warehouse optimizer 104. Direct connection to the data warehouse may be accomplished through one or more connections and/or communication protocols including, but not limited to, a Universal Serial Bus (USB) communication protocol and/orCommunication protocol (i.e., IEEE-1394). Further, the data warehouse optimizer 104 and the data warehouses 106, 108, 110 of the illustrated example are communicatively coupled via a network 114, such as an intranet and/or the internet. Suitable network 114 connections include, but are not limited to:(Ethernet) communication protocol, IEEE-802.11()、(bluetooth), 900MHz, 1.6GHz, and/or mobile communication protocols (e.g., CDMA, TDMA, GSM, AMPS, EDGE, etc.).
The data warehouse optimizer 104 of the illustrated example receives the data analysis instructions 102 and extracts relevant data from one or more data warehouses 106, 108, 110 in response to the analysis instructions 102. Rather than filling each data warehouse 106, 108, 110 with data queries (fetches) and statistical processing, the data warehouse optimizer 104 of the illustrated example extracts (e.g., copies) specific data responsive to a given set of instructions 102 from one or more data warehouses 106, 108, 110 and stores the extracted data to temporary storage 116. In this way, the data warehouse optimizer 104 relieves the data warehouses 106, 108, 110 from performing the duties of complex statistical data processing, allowing the data warehouses 106, 108, 110 to focus on the tasks of data storage, archiving, and providing data in response to requests. The memory 116 may be implemented by mass storage optical, magnetic, and/or solid state memory, and may be used to store the executable APIs described above and/or data collected from the data warehouses 106, 108, 110 to be used to build block statistics (block statistics), as discussed in further detail below.
Although only one optimizer 104 is shown in fig. 1, one of ordinary skill in the art will appreciate that multiple optimizers 104 may be used. The data warehouse optimizer 104 of the illustrated example is implemented on a computer, such as a Personal Computer (PC). Those skilled in the art will appreciate that the cost of computers and/or computer hardware has decreased, making personal computers a cost-effective alternative to specialized processing tasks. Additionally or alternatively, the data warehouse optimizer 104 may be implemented on a server that includes one or more hard disk drives to store the collected data received from the data warehouses 106, 108, 110. The personal computer and/or server implementing the optimizer 104 may be located at or near a data warehouse business facility to process client requests. Alternatively, the personal computer and/or server implementing the optimizer 104 may belong to one or more clients. Thus, for example, the data warehouse optimizer 104 may execute as an API on a client machine to create and/or process the data analysis instructions 102. Such an API may be sold or authorized to the customer, and/or the data warehouse company may charge the customer a per transaction fee for use of the API.
An example data warehouse optimizer 104 is shown in more detail in FIG. 2. In this illustrated example, the data warehouse optimizer 104 includes: a communication interface 202, a data retriever 204, an analyzer 206, and a report generator 208. Further, the example data retriever 204 includes: a Structured Query Language (SQL) generator 210 and a scheduler 212. As described above, the communication interface 202 enables communication over a dedicated connection, a direct connection, and/or a network 114 such as an intranet or the internet. Communication to/from the embodiment data warehouse optimizer 104 may occur through a web page (e.g., dynamic server home page), command line user interface, graphical user interface, and/or kiosk. The communication interface 202 may include one or more protective measures (e.g., firewalls) to protect the data warehouse optimizer 104 from unauthorized use and/or tampering.
Data analysis instructions 102, such as XML files, are received via the communication interface 202 and provided to the data retriever 204. The data analysis instructions 102 may contain one or more parameters, attributes, statistics, and/or formatting instructions for the requested report. For example, the data analysis instructions 102 may specify the type of data of interest and/or the following data: data should be extracted from the data warehouse for a particular time period (e.g., day, week, month, etc.), a particular broadcast program (e.g., sports events, movies, series, etc.), and/or a particular population (e.g., children, teenagers, adults, senior citizens, mexican-americans, polanan americans, etc.). The data analysis instructions 102 may also specify the particular data warehouse 106, 108, 110 from which data is to be retrieved.
Many data warehouses may contain private, confidential (e.g., attorney/client communications, social security numbers, bank account numbers, etc.) and/or business sensitive data. Thus, the data analysis instructions 102 may also contain one or more validation parameters to control connection and/or access to the entire data warehouse or to particular aspects of the data warehouse. Thus, the data retriever 204 of the illustrated example may invoke the communication interface 202 to establish a connection for data transfer. The verification between the data warehouse optimizer 104 and the data warehouses 106, 108, 110 may include, but is not limited to: secure Sockets Layer (SSL), digital certificates, password protection, encryption, and/or public key cryptography.
Once the example data warehouse optimizer 104 connects to a particular data warehouse 106, 108, 110, the SQL generator 210 may create a query command for the requested data identified in the data analysis instructions 102. In some examples, the instructions 102 may list "sitcom" as the data parameter for which the data warehouse optimizer 104 analyzes the rating amount. Similarly, the instructions 102 may list "retirees" as another data parameter for which the data warehouse optimizer 104 analyzes the number of viewership ratings. However, there may not be exact parameter terminology between the data analysis instructions 102 and one or more of the data warehouses 106, 108, 110. In this way, the data retriever 204 may resolve any inconsistencies between one or more of the data warehouses 106, 108, 110 and the analysis instructions 102 using, for example, metadata. Thus, for example, when the example data analysis instructions 102 list "sitcom," the data warehouse may instead use the noun "sitcom. Metadata is term mapped between one or more nouns such that a lack of exact consistency between the data warehouses 106, 108, 110 and the analysis instructions 102 does not hinder data processing. The data retriever of the illustrated example accomplishes metadata resolution by associating frequently used replacement terms with terms employed by a particular analysis instruction 102 (e.g., an XML file). Alternatively, the data analysis instructions may include any number of term noun aliases to associate with the data warehouse terms, such that if a first attempt to match an analysis parameter with a data warehouse parameter fails, an alternate term noun from the analysis instructions may be attempted.
In other examples, the data retriever 204 decomposes the data analysis instruction values into encoded values employed by one or more data warehouses. Instead of exact values, coded values may be used, e.g. to save memory. For example, the data warehouse may use an integer range from 1 to 4 to represent four different revenue levels. A value of "1" may represent a revenue threshold between $10,000 and $20,000; a value of "2" may represent a revenue threshold between $20,000 and $40,000; a value of "3" may represent a revenue threshold between $40,000 and $70,000; a value of "4" may represent a revenue threshold of over $70,000. In this way, the data retriever 204 may resolve (i.e., translate) any data received from the warehouse into an income bracket with respect to an appropriate income threshold after, for example, retrieving entries corresponding to particular values of 1 through 4. Thus, for example, if the data analysis instructions only request a threshold revenue between $20,000 and $40,000, then in this example, the data retriever 204 will generate an SQL query instruction that uses an association value of "2," which is understood by the target data warehouse as a valid query input noun.
In the illustrated example, the scheduler 212 can be constructed to periodically and/or aperiodically copy some or all of the data involved in the data analysis instructions 102 and/or constructing block statistics from one or more data warehouses into the memory 116. Thus, while the preceding examples show the data retriever 204 retrieving data from the data warehouse in response to receiving the data analysis instructions 102, relevant data warehouse information (e.g., one or more parameters, attributes, statistics, and/or formatting instructions) may already be stored in the memory 116 as a result of the retrieval occurring in a previous periodic, aperiodic, and/or manual data retrieval invoked by the scheduler 212. Thus, it may not be necessary to query one or more data warehouses in response to a given instruction set 102.
As discussed above, the data warehouse may encounter periods of heavy usage when a majority of the nations may be watching important events, such as during a Super Bowl or olympic game. The scheduler 212 may retrieve relevant data from the data warehouse (e.g., make a copy of the relevant portion of the database) during periods when such heavy usage is unlikely to occur, thereby ensuring faster query response times due to less burdened data warehouse processors. Similarly, the activity of the data warehouse may be low during early morning hours. Thus, the scheduler 212 may be configured to acquire (e.g., make copies of) one or more portions of the data warehouse each day, for example, at 2:00 a.m..
After completing the extraction of data from one or more of the data warehouses 106, 108, 110 and storage to the memory 116, the data warehouse optimizer 104 may invoke the analyzer 206 to process the data with respect to the data analysis instructions 102. Data analysis may be performed by the data warehouse optimizer 104 rather than being invoked by SQL commands/instructions to be executed by one or more processors of the data warehouses 106, 108, 110. The optimizer 104 may be programmed in any desired computer language. One of ordinary skill in the art will appreciate that Java programs are platform-independent and may execute faster than other programs implemented in SQL scripts. Thus, Java is particularly well suited for implementing some or all of optimizer 104.
The data warehouse optimizer 104 of the illustrated example is particularly suited for Business Intelligence (BI) reporting applications. The optimizer 104 of the illustrated example is capable of performing statistical calculations directly on the database 112 of the data warehouse 106, 108, 110 and/or on data in the memory 116. As the reporting requirements of the data analysis instructions 102 become more complex and the data set of the database 112 large, SQL-based approaches become increasingly inefficient and cumbersome. Unlike simple SQL extract commands implemented by using the data warehouse optimizer 104, SQL-based computations typically require multiple SQL runs and temporary tables that are relatively slow to execute. Furthermore, since the data extracted from the data warehouses 106, 108, 110 is focused on the customer-specific goals determined by the data analysis instructions 102, subsequent data processing of the data by the data warehouse optimizer 104 is performed faster because there is less non-relevant data to sort through.
The analyzer 206 of the illustrated example begins data analysis by processing building block statistics (also referred to herein as primary statistics). In general, the primary/building block statistics may include one or more calculations based on directly available data from the data warehouses 106, 108, 110. On the other hand, composite statistics (also referred to herein as secondary statistics) may be computed based in part on the results obtained by the primary statistics and/or additional data from the data warehouses 106, 108, 110. The building block statistics may be generated by a multi-stage process. For example, one or more preliminary stages of grouping and organization enable subsequent comprehensive statistical calculations. For example, the building block statistics may include, but are not limited to: the total number of viewers watching a particular broadcast program, a histogram of broadcast program ratings across the country, the average rating of the broadcast program in each state, etc. As discussed above, while the examples shown herein are particularly directed to broadcast programming and/or ratings, the systems, methods, and apparatus described herein may be applied to any type of data warehouse subject matter without limitation thereto. The building block statistics may also include discrete buckets (discrete buckets) of information and a classification of one or more subgroups contained therein. For example, a "demographic" type bucket may be calculated from the collected data. For any particular data sample size, such as a day, days, weeks, and/or months, the demographic bucket may contain different demographic subgroups. For example, a data sample during a weekend may include a demographic bucket with a relatively larger percentage of adult males watching television programs during the afternoon hours. As another example, a data sample during a weekday might include a demographic bucket with a relatively larger proportion of women aged 40-60 watching television programs during the afternoon hours.
The metadata also allows the data warehouse optimizer 104 to determine which data to obtain from the data warehouses 106, 108, 110. In some examples, the metadata contains relationships between facts and dimensions of the star schema. This relationship allows the SQL generator 210 of the example data retriever 204 to generate the appropriate SQL-based extraction commands. In another example, the data warehouse optimizer 104 examines the composite statistics generated by the data analysis instructions 102 via the analyzer 206. Further, the metadata may determine which building block statistics are needed to calculate the composite statistics based on the requested composite statistics.
On the other hand, the composite statistics may use some or all of the building block statistics. Such composite statistics may be computed by the analyzer 206 of the illustrated example during one or more processing stages subsequent to the building block stage. For example, the composite statistics may include determining the percentage of women aged 40-60 who watch television on weekdays in an income threshold of $70,000 or above. In this example, the composite statistics are based on the block statistics of: (1) women between the ages of 40-60, and (2) women with income of $70,000 or more. In other examples, the composite statistics may include determining a relative percentage rating of football versus olympic games watched by adult males during a given weekend. Such aggregate statistics may be based on one or more building block statistics.
The report generator 208 of the illustrated example provides processed data results to the client using report format parameters and/or default parameters specified in the data analysis instructions 102. The report may be provided in one or more formats, including but not limited to: tab delimited files, spreadsheets, charts, histograms, pie charts, and/or trend analysis curves. The report generator 208 of the illustrated example may additionally or alternatively invoke a web server of the communication interface 202 to form such report information in a web browser-visible format.
An exemplary flow diagram of example machine readable instructions that may be executed to implement the example data warehouse optimizer 104 of fig. 1 and/or 2 is shown in fig. 3-5. In these examples, the machine readable instructions represented by the various flow diagrams may contain one or more programs that are executed by: (a) a processor, such as processor 1012 in fig. 10(b), (b) a controller, and/or (c) any other suitable device. The one or more programs can be embodied as: software stored on a tangible medium, such as the memory 116 (see fig. 1), but persons of ordinary skill in the art will readily appreciate that all of the other one or more programs and/or portions thereof can be executed by a device other than the processor 1012 and/or implemented as firmware or dedicated hardware (e.g., implemented with Application Specific Integrated Circuits (ASICs), Programmable Logic Devices (PLDs), Field Programmable Logic Devices (FPLDs), discrete logic, etc.). For example, the processor 1012, the I/O controller 1012, the memory controller 1020, and/or the network interface 1030 can be implemented in any combination of software, hardware, and/or firmware. Also, some or all of the blocks represented by the flowcharts of fig. 3 to 5 may be manually implemented. Further, although the example machine readable instructions are described with reference to the flowcharts of fig. 3-5, persons of ordinary skill in the art will readily appreciate that many other techniques for implementing the example methods and apparatus described herein may alternatively be used. For example, with reference to the flow diagrams of fig. 3-5, the order of execution of the blocks may be changed, and/or some of the blocks described may be changed, eliminated, combined, and/or sub-divided into multiple blocks.
Referring to fig. 3, the communication interface 202 of the data warehouse optimizer 104 may receive the analysis instructions 102 in the form of, for example, a text file, a binary file, an XML file, or other instruction format (block 302). If the analysis instructions 102 require data from one or more data warehouses (block 304), the SQL generator 210 and the data retriever 204 of the data warehouse optimizer 104 obtain relevant data from the warehouses (block 306), as discussed below with respect to FIG. 4. If the analysis instructions 102 only require locally stored data (e.g., data previously retrieved from the warehouse by the scheduler 212) (block 304), the analyzer 206 of the data warehouse optimizer 104 processes the data stored in the local memory 116 (block 308).
If no analysis instructions 102 are received by the data warehouse optimizer 104 (block 302), the scheduler 212 of the data warehouse optimizer 104 determines whether a predetermined acquisition time has been reached (block 310). As discussed above, in examples where the data warehouse is implemented (in whole or in part), the scheduler 212 may be configured to query and obtain a copy of any new data found in the relevant portion of the warehouse every morning at 2:00 a day (block 306). If a fetch time has not occurred (block 310), control returns to block 302. If a fetch time has occurred (block 310), a request for the corresponding relevant data is invoked (block 312).
Returning to block 308, after the data has been processed in accordance with the analysis instructions 102 (block 308, as discussed further with reference to fig. 5), the report generator 208 generates a report for the client (block 314). Control then returns to block 302.
Fig. 4 is a flow chart illustrating an example manner of implementing blocks 306 and 312 of fig. 3. If the provided instructions specify one or more particular data warehouses (block 402), the communication interface 202 establishes connections with those data warehouses 106, 108, 110 via a validation protocol (block 404). On the other hand, if the analysis instructions 102 do not identify one or more particular data warehouses (block 402), the process 306 may proceed with the data in the local memory 116 and/or the data retriever 204 may connect, verify, and/or retrieve data from one or more data warehouses known to contain data related to the instructions (block 406). As discussed above, because the data warehouses 106, 108, 110 may employ different parameter terminology than that employed by the analysis instructions 102, the data retriever 204 may resolve such differences through the metadata (block 408). The query generated by the SQL generator 210 is then sent to a remote and/or local data source (e.g., the local memory 116 and/or one or more of the repositories 106, 108, 110) (block 409). The data returned by the data source is then stored in a data work file in memory 116 (block 410), which may contain one or more parameter data, one or more attributes, one or more statistics, one or more building block statistics, and/or one or more format instructions.
Fig. 5 is a flow chart illustrating an example manner of implementing block 308 of fig. 3. As discussed above, the data analysis instructions 102 may include one or more statistical formulas calculated using data extracted from the data warehouses 106, 108, 110. Alternatively, the analyzer 206 may include a default formula. The analyzer 206 retrieves the work file stored at block 410 of FIG. 4 and generates any building block statistics to be processed according to the analysis instructions 102 (block 502). Such building block statistics may be stored in memory 116 for later use. The analyzer 206 then processes the building block statistics with reference to the received target parameters (block 504). The building block statistics may be used as a basis for one or more other statistical calculations involving greater complexity, additional combinations, and/or staggered permutations of target parameters of the acquired data stored in memory 116. In this way, the building block statistics may be calculated in one or more batches/groupings. Upon completion of any one of the particular building block statistics identified in the analysis instructions 102, the analyzer 206 determines whether the analysis instructions 102 include additional building block statistics instructions (block 506). If more building block statistics are to be computed, control returns to block 502. Otherwise, the analyzer 206 retrieves the composite statistics instructions from the analysis instructions 102 and/or from default instructions associated with the analyzer 206 (e.g., from an XML file) (block 508).
As discussed above, the composite statistics may be computed in part by using earlier computed building block statistics. The analyzer 206 processes such composite statistics with reference to the target parameters and/or previously calculated building block statistics (block 510). The composite statistics may be calculated in one or more batches and/or groupings much like the building block statistics. Once any particular composite statistic instruction is completed, the analyzer determines whether the analysis instruction includes additional composite statistic instructions (block 512). If more comprehensive statistical instructions are to be computed, process control returns to block 508.
In addition to traditional BI reporting methods that perform complex statistical calculations, classifications, and/or groupings in an SQL-based manner, the data warehouse optimizer 104 employing the illustrated example allows for other options that are highly extensible. The data warehouses 106, 108, 110 are typically expensive monolithic servers that require extensive expense upgrades. In view of the above disclosed approach to data retrieval, grouping and statistical computation, a cheaper PC may be used in a scalable manner to generate client reports faster than in the prior art. One of ordinary skill in the art will appreciate that other programming languages may be used to implement all or part of the data warehouse optimizer 104. Such programming languages may include, but are not limited to: ASNI, C + + and/or C #.
To more efficiently store and access data obtained from one or more repositories, the example system may also be provided with a data formatter 600, as shown in fig. 6. The example data formatter 600 may be implemented as part of the data retriever 204 of FIG. 2 and/or operate as a separate module of the data warehouse optimizer 104. Without limitation, the example data formatter 600 may be completely separate from the data warehouse optimizer 104 and/or perform data formatting techniques on the data warehouse 106, 108, 110 information prior to storing the data warehouse 106, 108, 110 information to the memory 116.
The example data formatter 600 of fig. 6 includes: a communication interface 602 (or may share the communication interface 202 of fig. 2), a scheduler 604 (which may be the scheduler 212 of fig. 2), a data warehouse extractor 606, an organizer 608, a compression engine 610, and a memory 612. The memory 612 may be external to the example data formatter 600, may access the example memory 116 shown in fig. 1, and/or may be implemented by the memory 116.
The communication interface 602 operates in a manner similar to the communication interface 202 of fig. 2. For example, the communication interface 602 enables communication between the data formatter 600 and one or more data warehouses (e.g., the warehouses 106, 108, 110 shown in fig. 1). The communication may take place over a network connection, the network connection employingCommunication protocol, IEEE-802.11()、(Bluetooth), 900MHz, 1.6GHz, and/or mobile communication protocols (e.g., CDMA, TDMA, GSM, AMPS, EDGE, etc.). Additionally or alternatively, communication to/from the example data formatter 600 may be by employing a Universal Serial Bus (USB) communication protocol, and/orA direct connection of the communication protocol (e.g., IEEE-1394) occurs. Data received from the data warehouses 106, 108, 110 via the communication interface 602 may be transmitted to/by any of: a scheduler 604, a data warehouse extractor 606, an organizer 608, a compression engine 610, and/or a memory 612.
The data stored in the data warehouse may take one or more formats including an offline operations database, an offline data warehouse, a real-time data warehouse, and/or an integrated data warehouse. The data source of the data warehouse may refer to any electronic repository of information and may include a host database (e.g., IBMS DB2, VSAM, ISAM, Adabas, etc.), a client serverThe server database (e.g.,,,(SQL server), etc.), and/or PC databases (e.g.,,). Data warehouses typically store data that is grouped together by subject areas (e.g., consumers, products, finances, etc.) that may reflect the general purpose of the data. Such aggregation may include the use of a dimensional approach (dimension approach) in which information is stored as facts that may be numeric or textual data for transactions and/or events. Additionally or alternatively, a dimension (e.g., an architectural element of a database/warehouse structure) may contain reference information that allows transactions to be classified in different ways. For example, a sales transaction may be broken down into facts that include the quantity of many products ordered and the price paid. In addition, the dimensions may include other factors such as the date of sale, consumer information, product details, geographic location of sale, and by which salesperson the sale was made.
While the data warehouse optimizer 104 and the data formatter 600 can work with any database architecture, for purposes of illustration, a star architecture will be discussed below. The star schema includes a single fact table (fact table) with a composite primary key (e.g., Florida ratings 6-20-06, as shown in FIG. 7). The fact table contains quantitative or factual data about topics of interest, for example, to a company and/or other organization. Each different key of the fact table (e.g., the program ID and/or the household of fig. 7) includes a corresponding dimension table (dimension table) that may include additional fields for facts and/or different attributes of the embodiment service.
The communication interface 602 may be invoked by the scheduler 604 on a periodic, aperiodic, and/or scheduled basis to connect to one or more data warehouses. Many data warehouses contain sensitive and/or private data. Thus, such a repository may require validation prior to accessing and/or extracting the data contained therein. The verification between the data formatter 600 and the data warehouses 106, 108, 110 may include, but is not limited to: secure Socket Layer (SSL), digital certificates, password protection, encryption, and/or public key cryptography.
Once the data formatter 600 is connected to one or more particular data warehouses 106, 108, 110, the data warehouse extractor 606 constructs one or more query commands, such as SQL-based extract commands, to obtain data from the data warehouses. The fetch commands from the data warehouse fetcher 606 generally operate in a substantially indifferent manner and acquire all data according to a period of interest (e.g., a particular day, according to multiple days, according to a particular week, and/or according to any other time range). Without being limited thereto, the data warehouse extractor 606 can build the extraction commands in a more discriminating manner that may require SQL-based extraction commands of greater complexity. For example, the data warehouse extractor 606 may review and/or parse all or a portion of the data analysis instructions 102 to obtain the particular parameters that should be extracted from the data warehouses 106, 108, 110. The data extracted from the data warehouse may be saved in memory 612 for later organization, and/or the organization may occur in real-time during the data extraction process.
Based on the structure (domain specific details) of the data warehouse 106, 108, 110 from which the data was obtained, the organizer 608 creates a hierarchical directory file path structure. For example, in a star-architecture configuration for the data warehouses 106, 108, 110, each dimension is modeled as a directory. As described above, a dimension may be one or more structural framework elements of a database and/or warehouse, and each dimension may have one or more data columns. For example, a dimension of a product may include columns relating to product identification, product cost, product size, and the like. Each dimension directory may contain many separate directories for each column of a dimension. In general, the organizer 608 may create a first hierarchical directory file path associated with a top-most (e.g., broadest, most generalized) extraction of the data warehouses 106, 108, 110. Also, the organizer 608 may create a second hierarchical directory file path associated with one or more fetches of the data warehouse 106, 108, 110 that are more narrow than the fetches of the first hierarchical directory file path. Embodiment organizer 608 may create any number of additional hierarchical directory file paths to accommodate one or more data warehouses.
Beneath each column of a given dimension is a first binary file containing a mapping of dimension keys to Identifier (ID) data (e.g., "ID. Each key (e.g., one or more keys such as facts that produce a dimension table) is a primary key for a dimension, and the ID is a number representing the value of that column. Each column also includes a second binary file containing a mapping of ID data to a value (e.g., "value map. The use of both files allows the actual values of the columns of certain keywords to be derived (e.g., by queries based on the data analysis instructions 102) by mapping the file mapping hierarchy sequentially from top to bottom in a sequential manner. For example, if the fact table includes a key named "household" and the household dimension table includes a key named "income", the associated value for the query may be determined by following the logical hierarchical directory path "DIM/household/income" to retrieve the corresponding value (e.g., value 11.bin in fig. 7). The tour of one or more hierarchical directory paths described herein also reduces and/or eliminates the need for SQL commands when accessing data. As a result, data access can be simplified by using directory tours, such as web browsers, kiosks, and/or directory trees.
Once the hierarchical directory structure is completed by the organizer 608, the compression engine 610 may compress all of the binary files to conserve storage space. Compression techniques may include, but are not limited to: shrinking (shrinking) methods, reducing (reducing) methods, imploding (aggregating) methods, and/or tokenizing (tokenizing) methods. The ZIP file format may be employed to take advantage of its open format and rich availability. The compression engine 610 may also compress the entire hierarchical directory into a single ZIP file with a date key filename, such as "Julian date code" illustrated below in fig. 7.
FIG. 7 illustrates an example star schema 702 and an example hierarchical directory path 704 created by the data formatter 600. The example star schema 702 includes a fact table "florida ratings 6/20/06" 706 with keywords (e.g., primary keywords each associated with a dimension) labeled "program ID" 708 and "household" 710. The fact table includes "viewer" 712 as data. In addition, the example star schema 702 includes corresponding dimension tables for each key 708, 710 of the fact table 706. In particular, the example star schema 702 includes a "program ID" dimension table 714 and a "household" dimension table 716. The "program ID" dimension table 714 includes additional dimension keys: "world cup" 718, "PGA circuit match" 720, "our life" 722, and "spongy" 724. Similarly, the "household" dimension table 716 includes dimension keywords: "adult" 726, "young" 728, "child" 730, "TV number" 732, and "income" 734.
The example data formatter 600 creates a hierarchical directory path 704, which in the illustrated example of fig. 7 is a star schema, in a manner that reflects the data warehouse structure. In particular, the organizer 608 of the data formatter 600 creates a fact table path 736, a "program ID" dimension table path 738, and a "household" dimension table path 740. As described above, each corresponding directory includes an associated binary file (column 742) containing mapping and/or data information. The example hierarchical directory path 704 also includes a separate directory tag path 744 for the "program ID" and "household" dimension tables 714, 716. For example, once traveling to one of the top-level dimension table paths (e.g., 738 or 740) of the directory path 704, the corresponding binary (column 742) may contain mapping information to account for additional paths that may be traveled. Thus, tour to "NMR \ DIM \ program ID" associated with a binary file (e.g., dimension map file) "value 1. bin", provides mapping information to allow tour to one or more binary files of directory tag path 744. Finally, the directory tour may include any number of nested directory paths, each accessed sequentially to provide additional mapping information (e.g., one or more tour options), and/or data (e.g., ratings information). In the illustrated example of FIG. 7, the first hierarchical directory file path includes a fact table path 736, the second hierarchical directory file path includes a dimension table path "NMR \ DIM \ program ID" 738 or "NMR \ DIM \ resident" 740, and the third hierarchical directory file path may include any of directory tag paths 744. When the organizer 608 completes the hierarchical directory path 704, the compression engine may save the directory as a ZIP file 746.
FIG. 8 is a flow diagram illustrating an example process of retrieving and formatting data from data warehouse 800. As described above, the scheduler 604 may be configured to obtain data from the data warehouses 106, 108, 110 in a periodic and/or aperiodic manner (block 802). Additionally or alternatively, the acquisition of data may be invoked manually, such as by a client, a user, and/or a manager of a data acquisition and report generation service (block 804). Prior to obtaining data from one or more data warehouses 106, 108, 110, the communication interface 602 connects to the target data warehouse 106, 108, 110 and verifies the connection as a security measure (block 806) to maintain data integrity (block 806).
The data warehouse extractor 606 of the illustrated example constructs a query command (block 808) for the target data warehouse 106, 108, 110 to retrieve data, such as an SQL-based command. For example, the data warehouse extractor 606 may construct a simple SQL-based data extraction command with minimal specification constraints, such as requesting all data for a particular hour, range of hours, day, range of days, week, etc. The query command is then executed (block 810) and the data returned from the data warehouse is stored to the memory 612 (block 810).
The organizer 608 arranges (block 812) the received data into one or more hierarchical directory paths 704 that reflect domain specific details (domain specific details) of the originating data warehouse 106, 108, 110. As noted above, while the illustrated example employs a star schema format, the systems and methods described herein are not so limited, but may be adapted to any database schema. Once the hierarchical directory path organization is complete (block 812), the compression engine may compress the directory into, for example, a ZIP file (block 814). Control then returns to block 802.
FIG. 9 illustrates another example data warehouse optimizer 904. As described above, while the example data warehouse optimizer 104 of FIGS. 1 and 2 improves report generation time by a factor of ten as compared to an SQL database engine by partially relieving the data warehouse from any tasks associated with statistical calculations, and/or by performing such report generation processing with Java in place of SQL-based commands, additional improvements in report generation time may also be achieved by adding the data formatter 600. The example data warehouse optimizer 904 of fig. 9 may use the example hierarchical directory path 704 to increase report generation time consumption by a factor of approximately one hundred, rather than invoking SQL-based query commands for data warehouse data.
The example data warehouse optimizer 904 of FIG. 9 is essentially the same as the example data warehouse optimizer 204 of FIG. 2, except for the additional data formatter 600 in the data retriever 204. Like components are identified in fig. 9 with the same reference numerals as in fig. 2 and 6 and will not be discussed further.
Fig. 10 is a block diagram of an example processor system 1010 that may be used to execute the example machine readable instructions of fig. 3-5 and 8 to implement the example systems and/or methods described herein. As shown in fig. 10, the processor system 1010 includes a processor 1012 connected to an interconnection bus 1014. Processor 1012 includes a register set or register space 1016, which is depicted in fig. 10 as being a fully on-chip (on-chip), but which could alternatively be located entirely or partially off-chip and directly coupled to processor 1012 through dedicated electrical connections and/or through interconnection bus 1014. Processor 1012 may be any suitable processor, processing unit or microprocessor. Although not shown in FIG. 10, the system 1010 may be a multi-processor system and, thus, may include one or more additional processors that are identical or similar to the processor 1012 and that are communicatively coupled to the interconnection bus 1014.
The processor 1012 of fig. 10 is coupled to a chipset 1018, the chipset 1018 including a memory controller 1020 and an input/output (I/O) controller 1022. As is well known, a chipset typically provides I/O and memory management functions as well as a plurality of general purpose and/or special purpose registers, timers, etc. that are accessible or used by one or more processors coupled to the chipset 1018. Memory controller 1020 performs functions that enable processor 1012 (or processors if multiple processors are present) to access a system memory 1024 and a mass storage memory 1025.
The system memory 1024 may include any desired type of volatile and/or nonvolatile memory such as Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), flash memory, Read Only Memory (ROM), and the like. Mass storage memory 1025 may include any desired type of mass storage device, including hard disk drives, optical drives, tape storage devices, and the like.
The I/O controller 1022 performs functions that enable the processor 1012 to communicate with peripheral input/output (I/O) devices 1026 and 1028 and a network interface 1030 via an I/O bus 1032. The I/O devices 1026 and 1028 may be any desired type of I/O device, such as a keyboard, video display or monitor, mouse, or the like. Network interface 1030 may be a device that enables processor system 1010 to communicate with another processor system, such as an ethernet device, an Asynchronous Transfer Mode (ATM) device, an 802.11 device, a Digital Subscriber Line (DSL) modem, a cable modem, a cellular modem, and so forth.
Although the memory controller 1020 and the I/O controller 1022 are depicted in fig. 10 as separate functional blocks within the chipset 1018, the functions performed by these blocks may be integrated within a single semiconductor circuit or may be implemented using two or more separate integrated circuits.
Although certain methods, apparatus, systems, and articles of manufacture have been described herein, the scope of coverage of this patent is not limited thereto. On the contrary, this patent covers all methods, apparatus, systems, and articles of manufacture fairly falling within the scope of the appended claims either literally or under the doctrine of equivalents.
Claims (24)
1. A method of formatting data received from a data warehouse, the method comprising the steps of:
receiving an analysis instruction;
retrieving data from the data warehouse based on the analysis instructions;
associating the mapped file with a fact table, the mapped file stored in the first hierarchical directory file path, the fact table including primary keys associated with a dimension table, the dimension table associated with a dimension mapping file stored in the second hierarchical directory file path, the dimension table including dimension keys associated with data files stored in the third hierarchical directory file path; and
initiating a data query by sequentially traversing from the first hierarchical directory file path to the second hierarchical directory file path to the third hierarchical directory file path.
2. The method as defined in claim 1, further comprising compressing the first hierarchical directory file path, the second hierarchical directory file path, and the third hierarchical directory file path.
3. A method as defined in claim 1, further comprising locating the second hierarchical directory file path based on the mapping file stored in the first hierarchical directory file path.
4. A method as defined in claim 3, further comprising locating the third hierarchical directory file path based on a mapping file stored in the second hierarchical directory file path.
5. The method as defined in claim 4, wherein the third hierarchical directory file path contains query data associated with the first hierarchical directory file path, the second hierarchical directory file path, and the third hierarchical directory file path.
6. A method as defined in claim 1, further comprising calculating a primary statistic based on the retrieved data.
7. The method as defined in claim 6, wherein the primary statistic includes at least one of: the total number of entities performing an action, the average number of entities performing the action at a location, or the total number of entities performing the action over a period of time.
8. The method as defined in claim 7, wherein the total number of entities performing an action comprises a total number of viewers viewing a media program.
9. The method as defined in claim 6, further comprising calculating a secondary statistic based at least on the primary statistic.
10. The method as defined in claim 6, further comprising calculating a secondary statistic based at least on the primary statistic and the retrieved data.
11. A method as defined in claim 1, wherein the step of retrieving data from the data warehouse comprises the steps of: the analysis instructions are processed during at least one of a predetermined time or a periodic interval.
12. The method as defined in claim 1, wherein the instructions to analyze include instructions to retrieve at least one of: viewing member media viewing data, broadcast program rating data, or demographic rating data.
13. An apparatus for formatting data received from a data warehouse, the apparatus comprising:
a communication interface that receives analysis instructions and is communicatively coupled to at least one data repository;
a data retriever to retrieve data from the data repository, the data retriever including a database query engine to generate a query request;
a memory storing the retrieved data;
a data formatter to create a hierarchical directory file path structure of the data stored in the memory, the data formatter to create a first hierarchical directory file path associated with a mapped file and a second hierarchical directory file path associated with at least one of a second mapped file or a data file; and
an analyzer that sequentially traverses the first and second hierarchical directory file paths, respectively, in response to a query, to identify respective data among the retrieved data stored in the memory.
14. An apparatus as defined in claim 13, wherein the first hierarchical directory file path comprises a fact table path associated with the mapping file.
15. An apparatus as defined in claim 14, wherein the second hierarchical directory file path comprises a dimension table path associated with at least one of the second map file or the data file.
16. An apparatus as defined in claim 13, further comprising a compression engine to compress the first hierarchical directory file path and the second hierarchical directory file path and to further compress at least one of the map file or the data file associated with the first hierarchical directory file path and the second hierarchical directory file path.
17. An apparatus for formatting data received from a data warehouse, the apparatus comprising:
means for receiving an analysis instruction;
means for retrieving data from a data repository based on the analysis instructions;
means for associating a mapping file with the fact table, the mapping file stored in the first hierarchical directory file path;
means for associating a primary key of the fact table with a dimension table associated with a dimension map file stored in a second hierarchical directory file path;
means for associating the dimension key with a data file stored in the third hierarchical directory file path; and
means for initiating a data query by sequentially traversing from the first hierarchical directory file path to the second hierarchical directory file path to the third hierarchical directory file path.
18. An apparatus as defined in claim 17, wherein the apparatus further comprises means for compressing the first hierarchical directory file path, the second hierarchical directory file path, and the third hierarchical directory file path.
19. An apparatus as defined in claim 17, wherein the apparatus further comprises means for locating the second hierarchical directory file path based on the mapping file stored in the first hierarchical directory file path.
20. An apparatus as defined in claim 19, wherein the apparatus further comprises means for locating the third hierarchical directory file path based on a mapping file stored in the second hierarchical directory file path.
21. An apparatus as defined in claim 17, wherein the apparatus further comprises means for calculating a primary statistic based on the retrieved data.
22. An apparatus as defined in claim 21, wherein the apparatus further comprises means for calculating a secondary statistic based at least on the primary statistic.
23. An apparatus as defined in claim 21, wherein the apparatus further comprises means for calculating a secondary statistic based at least on the primary statistic and the retrieved data.
24. The apparatus as defined in claim 17, wherein the apparatus further comprises means for processing the analysis instructions during at least one of a predetermined time or a periodic interval.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US81642306P | 2006-06-26 | 2006-06-26 | |
| US60/816,423 | 2006-06-26 | ||
| PCT/US2007/014834 WO2008002578A2 (en) | 2006-06-26 | 2007-06-26 | Methods and apparatus for improving data warehouse performance |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1132815A1 HK1132815A1 (en) | 2010-03-05 |
| HK1132815B true HK1132815B (en) | 2013-01-25 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101542478B (en) | Method and device for improving performance of data warehouse | |
| US9747349B2 (en) | System and method for distributing queries to a group of databases and expediting data access | |
| US8126874B2 (en) | Systems and methods for generating statistics from search engine query logs | |
| US6684206B2 (en) | OLAP-based web access analysis method and system | |
| CN111971666A (en) | Dimension context propagation technology for optimizing SQL query plan | |
| US7822712B1 (en) | Incremental data warehouse updating | |
| US7552130B2 (en) | Optimal data storage and access for clustered data in a relational database | |
| US20150213109A1 (en) | System and method for providing big data analytics on dynamically-changing data models | |
| CN110335009A (en) | Report generation method, device, computer equipment and storage medium | |
| US20200058037A1 (en) | Reporting of media consumption metrics | |
| Rozic-Hristovski et al. | Users' information-seeking behavior on a medical library Website | |
| US7117218B2 (en) | System and method for expressing and calculating a relationship between measures | |
| US20030163465A1 (en) | Processing information about occurrences of multiple types of events in a consistent manner | |
| CN116860541A (en) | Service data acquisition method, device, computer equipment and storage medium | |
| Araque | Real-time Data Warehousing with Temporal Requirements. | |
| US7313559B2 (en) | System and method for analytically modeling data organized according to a referenced attribute | |
| US20120109875A1 (en) | Organization of data mart using clustered key | |
| US7275022B2 (en) | System and method for analytically modeling data organized according to non-referred attributes | |
| EP2082317A2 (en) | System and method for distributing queries to a group of databases and expediting data access | |
| HK1132815B (en) | Methods and apparatus for improving data warehouse performance | |
| CN115905425B (en) | Methods, devices, storage media and electronic equipment for identifying overdue data | |
| CN117035925A (en) | Commodity recommendation method and device, storage medium and computer equipment | |
| Lin | Lecture Unit on Data Warehousing with a Case Study | |
| Yao | Design issues in data warehousing: a case study | |
| Lehner | Personalized Information Delivery |