EP3391192A1 - Broadening field specialization - Google Patents
Broadening field specializationInfo
- Publication number
- EP3391192A1 EP3391192A1 EP16876594.9A EP16876594A EP3391192A1 EP 3391192 A1 EP3391192 A1 EP 3391192A1 EP 16876594 A EP16876594 A EP 16876594A EP 3391192 A1 EP3391192 A1 EP 3391192A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- cache
- code
- application
- bee
- spp
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformation of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F7/00—Methods or arrangements for processing data by operating upon the order or content of the data handled
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2453—Query optimisation
- G06F16/24534—Query rewriting; Transformation
- G06F16/24549—Run-time optimisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/44—Arrangements for executing specific programs
- G06F9/445—Program loading or initiating
- G06F9/44521—Dynamic linking or loading; Link editing at or after load time, e.g. Java class loading
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
Definitions
- the present disclosure is generally related to field specialization broadening to multiple application domains, and more particularly is related to systems and methods for broadening field specialization to extend particular values of the invariants beyond the source code of the application to the data sources themselves.
- a database management system is a collection of software programs that manage the storage and access of data.
- DBMSes have been adopted across a wide range of application domains.
- DBMSes have been designed and engineered based on a few data models that are generally applicable to those domains.
- the relational data model is the one most prevalently adopted by commercial and open-source DBMSes. A significant amount of effort has been devoted to efficiently support this data model.
- relational database management systems are themselves general, in that they can handle whatever schema the user specifies and whatever query or modification is presented to them. Relational operators work on essentially any relation and must contend with predicates specified on any attribute of the underlying relations. Through such innovations as effective indexing structures, innovative concurrency control mechanisms, and sophisticated query optimization strategies, the relational DBMSes available today are very efficient. Such generality and efficiency has enabled their proliferation and use in many domains.
- Field specialization technology has been developed to automatically identify invariants and effect code specialization based upon invariants. Field specialization is the process of inserting spiffs into DBMS code so that the DBMS can specialize itself at runtime to exploit runtime invariants.
- a spiff which stands for specializer in the field, is code that dynamically creates specialized code at DBMS runtime. The specialized code is both smaller and faster than the original unspecialized code.
- Field specialization gets its name from the fact that the code is specialized "in the field," i.e., after the DBMS has been deployed and is running at the end user's site.
- a spiff uses the actual value of a runtime invariant—which can only be known at runtime— to dynamically produce code that is specialized to that particular value of the runtime invariant.
- Embodiments of the invention relate to field specialization broadening to multiple application domains, and more particularly is related to systems and methods for broadening field specialization to extend particular values of the invariants beyond the source code of the application to the data sources themselves.
- an ecosystem specification is disclosed to enable field specialization broadening. This specification states which (potentially multiple) applications are involved, what (potentially multiple) input data sources are read by one or more applications, what intermediate and final data products are produced by those applications, and what services are invoked by the applications, thereby providing valuable information on how intermediate results are communicated among the various applications.
- a static and/or a dynamic analysis determines that an invariant within a specialization oppoitunity arose even earlier than an assignment to a C variable within the application; rather, it arose from a data component at a particular location within an input data file, such as oil field data or a config parameter utilized by an oil-field simulation software. Once such a value flow is determined, spiffs are created in the appropriate applications and, when given a particular oil field data file, the spiffs are executed to produce a speccode of a specialization opportunity within the oil field simulator application.
- This speccode could be associated with the particular oil field data file, to be loaded dynamically into the oil field simulator application (or passed to the operating system or stored in the IBM cluster) when that application is run. In this way, all of the applications involved in a computation can participate and be optimized, via a broadened conception of field specialization.
- the method of field specialization broadening comprises a sequence of steps: (i) deciding what (extended) invariants are there, (ii) deciding which of these extended invariants should be specialized upon, (iii) deciding what code sequence within a specialization opportunity should actually be specialized, (iv) deciding when to perform the specialization (and thus, within which application), (v) deciding where to store the generated speccode, and finally, (vi) deciding how to transfer the speccode to the site (application) where it will be invoked.
- the method of field specialization broadening comprises extending step (i) to include invariants that flow from external data or external applications, step (iv) to include spiffs located in an application separate from the application containing the source of the invariant that is being utilized by the spiff, or even autonomous from any application, and steps (v) and (vi) to include speccode stored separately from the application being specialized.
- a method is disclosed to estimate cache pressure given a specific workload and to solve the problem of instruction cache misses in the presence of code added at run-time, such as in the context of micro-specialization, when a specialized version of an often-executed portion of the DBMS can speed query evaluation.
- the algorithms include a MaxResidency algorithm, which uses information from dynamic analysis of provided workloads prior to DBMS compilation as well as the specific structure of the queiy evaluation plan to place runtime code wisely, thus retaining most of the possible run time improvement of that code. It is demonstrated that the algorithm minimizes I-cache pressure for workloads that utilize as many as 1000 dynamically-linked functions. Via this algorithm, aggressive micro-specialization of a large number of functions invoked during query evaluation can effectively mitigate an increase in I-cache pressure and thus also 1 ,2- cache pressure.
- FIG. 1 is a block illustration of a single application field specialization process.
- FIG. 2 is a block diagram illustrating the spiff toolset, in accordance with an exemplary embodiment provided by this disclosure.
- FIG. 3 is a block illustration of a field specialization broadening process involving three extensions in accordance with an embodiment of the present invention.
- FIG. 4 is an illustration of a flow process of broadened field specialization with an embodiment of the present invention.
- FIG. 5 is an illustration of an alternative flow process of broadened field specialization with an embodiment of the present invention.
- FIG. 6 is an illustration of field specialization for elaboration a paradigm of computer science with an embodiment of the present invention.
- FIG. 7 is an exemplary Cache Hierarchy with an embodiment of the present invention.
- FIG. 8 is an illustration of percentage increase of LI Misses of Query22 with Various Bee Placements with an embodiment of the present invention.
- FIG. 9 is an illustration of Cumulative Cache-Line Reference histogram with an embodiment of the present invention.
- FIG. 10 is a block diagram illustrating workflow of profile-agnostic bee placement algorithms with an embodiment of the present invention.
- FIG. 11 is a block diagram illustrating workflow of SPP-guided bee placement algorithms with an embodiment of the present invention.
- FIG. 12 is a schematic diagram showing stages of a query plan tree with an embodiment of the present invention.
- Fig. 13 is an exemplary flow diagram for the evaluation of cache pressure when introducing code into a DMBS with an embodiment of the present invention.
- FIG. 14 is an illustration showing impact of increasing number of bees on cache miss rate percentage with an embodiment of the present invention.
- the disclosure also can be practiced in distributed computing environments, where tasks or modules are performed by remote processing devices that are linked through a communications network. Moreover, the disclosure can be practiced in Internet-based or cloud computing environments, where shared resources, software and information may be provided to computers and other devices on demand. In a distributed computing environment, program modules or subroutines may be located in both local and remote memory storage devices. Aspects of the disclosure described below may be stored or distributed on computer-readable media, including magnetic and optically readable and removable computer disks, fixed magnetic disks, floppy disk drive, optical disk drive, magneto -optical disk drive, magnetic tape, hard-disk drive (HDD), solid state drive (SSD), compact flash or nonvolatile memory, as well as distributed electronically over networks including the cloud. Data structures and transmissions of data particular to aspects of the disclosure are also encompassed within the scope of the disclosure.
- a "spiff,” which stands for specializer in the field, is code that dynamically creates specialized code at DBMS runtime.
- Field specialization is the process of inserting spiffs into DBMS code so that the DBMS can specialize itself by exploiting runtime invariants.
- the specialized code (which may be referred to herein as "speccode”) is both smaller and faster than the original unspecialized code.
- Field specialization gets its name from the fact that the speccode is generated and invoked "in the field,” i.e., after the DBMS has been deployed and is running at the end user's site.
- a spiff uses the actual value of a runtime invariant— which is obtained at runtime— to dynamically produce code that is specialized to that particular value of the runtime invariant.
- micro- specialization is equivalent to the term "field specialization” as used herein;
- bee is equivalent to the term “spiff as used herein;
- specialized code is equivalent to "specialized code” as used herein, which is the result of a spiff; and
- HRE hive runtime environment
- FIG. 1 is a block illustration of a single application field specialization process.
- the exemplary single application field specialization process 100 is related to an oil field over time.
- the single-application field specialization process has been disclosed and included in co-pending patent application with application number 62/142,325 (Dataware 15.01-P). Both the initial field specialization and the broader one discussed here apply to any application domain that processes the same data multiple times.
- a simulator 130 receives inputs from a field dependent source data 120 and a user-defined parameter 110.
- the field dependent source data 120 (“oil field data”) specifies a structure of an oil field, a three-dimensional assemblage of rock structures, some of which hold various mixtures of oil, gas, water at various pressures and some of which are impermeable.
- the user-defined parameter 110 (also referred as "config params" or workload) specifies wells to be drilled at various times and fluids to be extracted at various rates over time.
- the simulator 130 then computes how the fluids flow throughout the oil field over time, over simulated time of perhaps many years, producing a "simulation result”.
- the simulations may be implemented many times, each time with a different user-defined parameter 110 and with a single simulation result 140 as an output.
- the simulator may pass the simulation result 140 to a Linux operation system 150, which then passes the data to a computer cluster (such as an IBM cluster) 160.
- a computer cluster such as an IBM cluster
- Figure 1 is shown with the exemplary oil field related application, it is understood that the single application field specialization process may be applicable to various fields.
- the oil field simulator may be written in a high- level language such as C.
- a Spiff toolset shown in Figure 2 that analyzes the simulator can follow the flow of values through this C program, flowing from cold code into hot code as a data value is copied from one C variable, perhaps within a complex data structure, to another.
- the Toolset can then identify specialization opportunities: segments of hot code, that is, a program component that's frequently executed, containing variables that are invariant at runtime.
- the values of these variables are known to have originated in cold code (i.e., written once) and are used (i.e., read many) in branching or other statements in hot code, such that by knowing the particular value of an invariant, the identified hot code segment may be specialized to run faster, say by removing a branch statement along with the dead branch(es) from hot code.
- the specialization opportunities enable the creation of spiffs, which are placed in the cold code and can produce specialized code (abbreviated as speccode).
- FIG. 2 is a block diagram illustrating a spiff toolset, in accordance with an exemplary embodiment provided by this disclosure.
- the spiff tool toolset 200 receives inputs as an Application's source code (or application code) 210, one or more workload 220, and an Ecosystem Specification 230 (referred to herein as "Ecosystem Spec") and outputs a Specialized Source Code 240.
- a graphical user interface is placed at a higher level than the Spiff Tools to be used by the application developer to request that the tools be executed in sequence and to visualize how the field specialization was achieved, by showing in an appropriate manner the intermediate and generated results. This GUI also provides actions that the developer can perform that will impact the operation of other tools.
- the spiff tool toolset 200 is constructed as a field specialization broadening module.
- the source code 210 is a DBMS source code comprised within a DBMS 212 of one application.
- the spiff tool toolset 200 provides an end-to-end solution that takes source files for a program or suite of related programs and automatically provides field- specialized versions of those source files, including the code for spiff generation, spiff compilation, spiff instantiation, spiff invocation, and spiff garbage collection.
- the spiff toolset 200 includes a number of tools, including but not limited to an invariant finder 201, an invariant checker 202, a summarizer 203, a snipper finder 204, and a spiff maker 205.
- the invariant finder 201 may couple to both the Application's source code 210 and one or more workloads 220 and perform static analysis on the AST and output zero or more Invariant Intervals.
- the invariant checker 202 ensures the validity of output Invariant Intervals from the invariant fmder 201.
- the summarizer 203 produces execution summaries, which provide output as a list of functions, statements, and variables along with their execution statistics. Such information indicates "hot spots" within the application that could benefit from field specialization.
- the Snippet Finder 204 first expands invariants to across program executions by tracking which variables are read from files and where those values are put into files. This results in invariant intervals that span multiple executions.
- the tool also tracks when the value was first written to the file and when it is deleted.
- the Spiff maker 205 takes one or more Candidate Snippets as input and output the Specialized Source Code 240.
- the aforementioned tools within the spiff toolset 200 are placed within a single computer apparatus. In some embodiments, the aforementioned tools within the spiff toolset 200 are placed among distributed computers.
- the Ecosystem Specification 230 provides additional information that enables the analysis performed by the simulator 130 to be broadened considerably.
- the specification 230 states which (potentially multiple) applications are involved, what (potentially multiple) input data sources are read by one or more applications, what intermediate and final data products are produced by those applications, and what services are invoked by the applications, thereby providing valuable information on how intermediate results are communicated among the various applications.
- the Ecosystem Specification 230 may capture a bigger picture beyond the single oil field simulator 130, which only takes a single "oil field data” source and one "config params" data source as inputs to produce a single “simulation result” data product.
- the input, intermediate, and final data sources along with the multiple applications that participate in the computing process allows field specialization to be broadened such that the flow of particular values of the invariants may be extended beyond the source code of a single application.
- field specialization broadening also enables across file RW (read / write) operations and even operations via network communication, such as passing the resulting data to an operating system such as Linux and to other destinations.
- static and dynamic analysis may determine that an invariant within a specialization opportunity arose even earlier than an assignment to a C variable within the single application. Instead, the invariant is extracted from a data component at a particular location within an input, such as the oil field data or the config parameters.
- spiffs may be created in the appropriate applications.
- the spiffs is executed to produce speccode of a specialization opportunity within the oil field simulator application.
- This speccode could be associated with the particular oil field data file, to be loaded dynamically into the oil field simulator application (or passed to the Linux OS 150 or stored in the IBM cluster 160) when that application is run. In this way, all of the applications involved in a computation can participate and be optimized, via a broadened conception of field specialization.
- At least three exemplary extensions may be implemented for field specialization broadening.
- the first extension to field specialization is cross-application value flows (where a value transfers out of one application and subsequently into another application), specified by the ecosystem spec for how data travels from one application to another.
- a value flows from one C variable to another via a data copy, implemented with low- level load and store operations.
- additional kinds of transfers can be accommodated in the invariant flow analysis. For instance, one such transfer is passing a value as a parameter (which can be a structure) to an operating system call and receiving a return value from a system call can establish data flow between an application and the OS.
- the ecosystem spec can support specification of other means of communication, such as exchanging data across devices.
- a graphics processing unit may communicate with a central processing unit (CPU) over a memory bus, with values copied via network communication channels, such as sockets.
- the ecosystem spec annotates a plurality of connecting paths with the means for such value transfers and the Spiff Toolset analysis utilizes such transfers to extend the value flows, thereby identifying additional cross-application value flows and additional specialization opportunities within each of the participating applications.
- the second extension to field specialization is an inter-application analysis.
- the static and dynamic analysis is performed by the Spiff Toolset not just on the source code of a single application, but also across the data read and written by that application. That analysis may be similarly extended to other applications or operating systems invoked by an application, such as the Linux operating system, or storing invariant values, such as within an IBM cluster, shown in Figure 1.
- a particular value within the oil field data flows into the oil field simulator application, then flows into the Linux operating system as a system call parameter or as data passed through a system call, then flows into the IBM cluster. This same value may flow back into the simulator from the Linux O/S as an operating system return value or data value returned.
- Inter-application analysis combines the per- application value flow analysis, performed by the Spiff Toolset, with analysis of data read by application or sent from one application to another. Inter-application analysis need to understand the semantics of the cross-application value flows and be able to combine value flows within a single application to compute value flows across applications. Furthermore, inter-application analysis is able to determine where the speccode that is associated with particular invariants can be best instantiated.
- the third extension to field specialization is invariant cross-application termination, verifying the possibility of an invariant originating in an application (or data source or product) and terminating in a specialization opportunity in a separate application.
- This enables flexibility in placing the spiff, in associating speccode with a data source or product or application, and in shipping the speccode to the destination application.
- the Spiff Toolset can determine where the speccode can be best stored and subsequently communicated to the relevant application where it will be installed.
- the Spiff Runtime Environment needs mechanisms to handle the variations enabled by the structure provided by an ecosystem spec.
- FIG. 3 is a block illustration of a field specialization broadening process involving the three extensions just described, in accordance with an embodiment of the present invention.
- the field specialization broadening process 300 comprises additionally a router 370 (a hardware device that now comes with significant computation capacity) and a cloud service 380 (including data analytics).
- the router 370 receives inputs from the Linux operation system 150 and passes the inputs to the cloud service 380.
- speccode 310 may involve invariants both from the oil field data 120 and the oil field simulator 130.
- Speccode 320 may involve invariants both from the config params 110 and the oil field simulator 130.
- Speccode 330 stored in the Linux operating system 130 may involve invariants from the simulator and oil field data.
- Speccode 340 stored in the router 370 and Speccode 350 stored in the cloud 380 may also involve invariants from the simulator and the oil field data.
- the speccode stored within the simulator may originate from the oil field data and from the simulator itself, and may be identified through basic field specialization. Other spiffs are stored with the operating system, router, and cloud, specializing code found in the indicated applications. In some embodiments, speccodes may flow from where they are stored to where they may be invoked (the application that provided the specialization candidate from which they were subsequently specialized). For example, the oil field data may store the router speccode 340. In some embodiments, speccode identifiers can reside with data or with applications and can also be included in communications with subsequent applications, indicating the relevant speccode to (later) invoke.
- a backend web server is responsible for processing input data obtained from user activities at the front-end, i.e., in web browsers that run javascript applications.
- the user inputs can vary and hence present different patterns within the data collected at the backend.
- the backend server can then produce client (javascript) speccode for different value patterns that can then be pushed to the (appropriate) browsers.
- Data center work distribution A distributor for work packets is common in big data applications (such as Hadoop).
- the distributor is an application that can parcel out the work based on certain attributes, which can then serve as invariants to specialize the tasks ruiming on different worker nodes.
- the spiff and the speccode it generates is in an application separate from the invoked application.
- FIG. 4 is an illustration of a flow process of broadened field specialization with an embodiment of the present invention.
- broadened field specialization there is a sequence of decisions to be made.
- the flow process of broadened field specialization 400 shown in Figure 4 comprising a plurality of steps to address the sequence of decisions.
- Step 410 is deciding what (extended) invariants are present; Step 420 is deciding which of these extended invariants should be specialized upon; Step 430 is deciding what code sequence within a specialization opportunity should actually be specialized; Step 440 is deciding when to perform the specialization (and thus, within which application); Step 450 is deciding where to store the generated speccode, and finally, Step 460 is deciding how to transfer the speccode to the site (application) where it will be invoked.
- the steps shown in Fig. 4 are related to broadened field specialization, it is understood that the steps are also applicable to conventional field broadening, since the steps are necessary for field specialization in general.
- FIG. 5 is an illustration of an alternative flow process of broadened field specialization with an embodiment of the present invention.
- step 510 in Figure 5 is extended to invariants that flow from external data or external applications;
- step 540 in Figure 5 is extended to spiffs located in an application separated from the application containing the source of the invariant that is being utilized by the spiff, or even autonomous from any application, and
- steps 550 and 560 in Figure 5 are extended to speccode stored separately from the application being specialized.
- the flow process of broadened field specialization disclosed within this disclosure may be stored as computer executable instructions within a non-transitory computer readable medium and executable by a processor of a computing device.
- the specialization broadening process may be executed by a plurality of computers with different computers executing different process steps.
- FIG. 6 is an illustration of field specialization for elaboration a paradigm of computer science with an embodiment of the present invention.
- the diagram comprises four quadrants 10, 620, 630 and 640 for scenarios of data represented as data, code represented as data, data represented as code, and code represented as code respectively.
- Postscript language In the 1980's the Postscript language was invented. This was code that when executed would create an image. Postscript is produced by a formatter, taking a document such as a Microsoft Word file, which is data, and converting to a program, again, code as data, as represented in quadrant 620.
- Field specialization takes this idea further.
- Field specialization takes the values of invariants, that is, data, and uses these values to create a specialized code version of a portion of an application, such as a DBMS, which is code that can be executed.
- a relation speccode is the result of specializing DBMS code using the schema of a relation (data).
- a tuple speccode is the result of using the data values within a tuple (row of a table).
- An O/S speccode is a specialization of a snippet of an operating system based on particular data values of particular invariants within that snippet; ditto for router speccodes.
- This data-represented-as-code (as represented in quadrant 630) can be created in one application from a snippet in that application or another application, passed around between applications, and invoked by the target application when appropriate.
- the field specialization technology provides the means for identifying when such speccode are effective in increasing performance, when they should be created, with which invariants should they be specialized upon, how they can be communicated across applications, and when they should be invoked.
- the fourth extension involves introducing code into a DBMS at running-time while minimizing cache pressure simultaneously.
- FIG. 7 is an exemplary cache hierarchy with an embodiment of the present invention.
- the cache hierarchy 700 shows a communication between a CPU 710 comprising CPU cache(s) and a main memory 740.
- a CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory.
- the cache is a smaller, faster memory which stores copies of the data from frequently used main memory locations.
- Most CPUs have different independent caches, including instruction cache (I-cache) and data caches (D-cache), where the data cache is usually organized as a hierarchy of more cache levels (LI , L2, etc.). As shown in Fig.
- the CPU 710 comprises one or more cores 720 with each core having its own cache hierarchy.
- Each core 720 comprises a LI I-cache 722 and a LI D-cache 724, which are coupled to a L2 cache 726 inside the core.
- An L3 cache 730 couples to L2 cache 726 of each core and also couples to the main memory 740 for date /instruction exchange.
- a variety of placement algorithms are disclosed that mitigate the I-cache pressure in the presence of run-time code generation within a DBMS. These algorithms use a combination of off-line profiling and on-line code placement that takes into account the query plan, to compute a novel kind of profile called the Slot Pressure Profile, which estimates the pressure at each cache slot.
- the MaxResidency algorithm which achieves the best performance among these placement algorithms, first collects the profiles of operators off-line, then examines the query plan and estimates the I-cache's pressure by taking into consideration all involved operators. Finally it determines the memory location for the generated code that will be least harmful to I-cache performance.
- SPP Slot Pressure Profile
- a hybrid profile-guided placement algorithm is disclosed that performs the expensive profiling off-line and does on-line profile computation from the off-line profiles very efficiently.
- the disclosure concerns the efficiency challenges of a confluence of three generally disparate threads: a) dealing with cache pressure, b) handling large amounts of data via a DBMS, and c) just-in-time (JIT) compilation.
- JIT just-in-time
- JIT compilation One major research topic of JIT compilation is to improve the cache performance when instructions are generated and mapped to different cache slots.
- State-of-the-art results include: code reordering, where the virtual machine separates code and data space, rearranges methods into hot and cold portions, does code padding for frequent caller/caller pairs; cache line reservation, where some cache lines are reserved for several particular objects to avoid the eviction of these objects; partial code compilation, where dynamic analysis of control flow is utilized to avoid optimizing large functions; code tiling, where the call graph is divided into components that each fit within one cache slot in a greedy manner; JIT co-allocation, where a compiler co-allocates neighboring procedures in the call chain rather than procedures that frequently call each other; conflict matrix coloring, where a conflict matrix records the potential conflicts for a sequence of instructions and coloring is done based on the matrix; locality grouping, where the information of interaction graph and sequence of instruction is combined to group and reorder methods for co-allocation; and phase prediction, where the information of reuse distance is used
- the JIT approach has made its way into databases, in two parallel guises.
- One way is micro-specialization, in which frequently-executed functions containing branch instructions on identified invariants are specialized at run- time, with the newly-generated code read in (again, at run-time) to improve performance.
- the other way is query compilation, where the DBMS query is compiled into machine code at run-time, then that machine code is loaded and execute.
- the focus of the disclosure is the former, but the placement algorithms disclosed are applicable to the latter.
- Micro -specialization speeds up a DBMS by specializing a general function, using information not known at compile time. For example, the number of columns in a specific table cannot be known at compile time, so a general function must be able to handle all possible numbers of columns. But the number of columns in the table is known and unchanged (except for ALTER TABLE commands) once the table is created. If this constant is hard-coded into a new function and replace the old one, at least the cost of passing that one parameter is saved, along with any branches on that value. Other invariants, such as the type and size of each column and whether each column is nullable, can also be specialized. However, this can only happen after the DBMS is compiled and the table is created, because the number of possible combinations is simply untenable.
- the disclosure of this invention is to address the problem of (a) cache pressure in the context of (b) a DBMS utilizing (c) JIT techniques.
- the focus of the disclosure is on micro-specialization but the techniques also apply to some kinds of query compilation. In either case, it is desirable to retain the benefits of (i) main- memory/disk performance improvements utilized by conventional DBMSes, (ii) cache pressure reduction techniques commonly utilized by conventional compilers, (iii) any cache pressure reduction techniques utilized by modem DBMSes, and (iv) performance improvements realized by recent JIT approaches, specifically, micro- specialization, thereby achieving a DBMS that is highly efficient with regard to disk accesses, data cache pressure, and instruction cache pressure.
- each instruction executed is first fetched into the CPU cache.
- the slot in the cache (whether the first level instruction cache, the second level integrated cache, or even the third level integrated cache shared across CPUs) where an instruction is located, namely a cache line, is determined by the instruction's ad- dress in the virtual memory space.
- cache conflicts may occur, resulting in expensive cache-miss penalties (cache-line eviction and instruction fetching from slower levels in the memory hierarchy.
- query evaluation usually involves executing a large number of instructions over many iterations. This substantial footprint of query evaluation code can potentially lead to performance degradation due to high instruction-miss penalty. While compilers such as gcc apply basic-block reordering techniques to improve code locality thus reducing instruction cache misses, the bee code does not benefit from this optimization in that bees do not exist when the DBMS is compiled.
- a 3C model [M. D. Hill and A. J. Smith. Evaluating associativity in CPU caches. IEEE Trans. Comput, , 38(12): 1612— 1630, Dec. 1989] provides terminology for the types of cache misses in each situation:
- a capacity miss occurs when the cache simply can't fit all the accessed memory locations; such misses can be reduced with a larger cache, in which the accessed memory locations associated with a particular cache slot get spread across several slots in the larger cache.
- a conflict miss occurs when an entry that was present in the cache gets evicted by an access that would have associated the same slot even with a larger cache; such misses can be reduced by a higher cache associativity.
- a function F associates a cache slot S if the memory address of any piece of F's machine code maps to S.
- the working set of a collection of functions to be executed is defined as the collection of the distinct memory blocks touched over the execution of a sequence of instructions.
- the working set at the cache level is of particular interest.
- the cache slot working set refers to the memory blocks that are in that same working set and map to the same cache slot.
- a bee is a specialized version of a DBMS function that is created and loaded into memory at run-time. Such functions may occupy one memory block, multiple contiguous blocks, or multiple non-contiguous blocks.
- Figure 8 presents a study of the relationship between various placements of a single bee (indicated by the x axis) of a bee and the percentage increase in I-cache misses (shown by the y axis), during the evaluation of a query.
- the experiment machine is equipped with a 32 K I-cache. Given that each cache line is 64 bytes wide and the cache is four-way set associative, there are a total of 128 cache slots, each consisting of four cache lines. 16 placements are sampled uniformly across all 128 possible cache slots. In other words, the same query is evaluated 16 times, each time with a different placement of the bee executed during each query evaluation.
- FIG. 9 is an illustration of Cumulative Cache-Line Reference histogram with an embodiment of the present invention.
- Fig. 9 suggests determining for a given queiy the cache pressure at each slot, and realizing a histogram and placing bees in the valleys where there is little pressure.
- the attractive valleys are at ox07-ox0f and ox56-ox60. A clever scheme will be introduced later.
- the existing DBMS code has a small working set at this slot, so doesn't use up its cache capacity. In this case, the bee only causes compulsory misses, introducing little cache pressure.
- the existing DBMS code has a large working set at this slot, so uses up the cache capacity and experiences both capacity and conflict misses.
- instructions in the bee evict existing DBMS code and cause an increase of conflict misses.
- Such behavior is called a DBMS-bee conflict. This case is fairly predictable, and indeed algorithms will be provided that do a good, job of avoiding such conflicts.
- the existing DBMS code has a medium working set at this slot, so uses up the cache capacity yet experiences no conflict misses.
- the DBMS will revisit its existing code as well as the bee frequently during query evaluation, instructions in the bee will evict existing DBMS code from, the cache and thus perhaps cause a sharp increase of conflict misses. This is the challenge to be addressed.
- the average cache miss rate of TPC-H queries with a naive placement rises to 0.912% with 200 bees, 1.66% with 400 bees, and 2.95% with 1000 bees. This is of concern when a cache miss is more than 10 times slower than a cache hit. A more sophisticated placement algorithm is required.
- Profile-agnostic algorithms may be relatively simple in terms of design but generally offer mediocre performance. They are introduced as warm-up and further opportunities are explored for optimization.
- FIG. 10 shows a block diagram illustrating workflow 1000 of profile- agnostic bee placement algorithms with an embodiment of the present invention.
- the workflow comprises a slot selection phase 1005, a bee insertion phase 1010, and a query evaluation phase 1015, all of which occur while the DBMS is running.
- the slot selection phase 1005 and bee insertion phase 1010 can be done at query optimization time just before evaluating the query.
- the slot selection phase 1005 receives static information, specifically the machine code of the DBMS, which provides the address of the caller of each bee, and the bee code, which includes the name of its caller, to decide the slot to associate, then it passes that slot number to the bee insertion phase 1010, which finds an open memory address that maps to that slot, places the bee and checks if all the bees have been placed.
- the bee insertion phase 1010 feedbacks a next bee to the slot selection phase 1005 to start the slot selection process again.
- the query evaluation phase 1015 actually performs the query, calling whatever spiffs were created for that query, and then sends a next query to the slot selection phase 1005 to start the slot selection process again.
- Slot Selection phase 1005 This phase yields the slot that the bee's address should map to.
- this phase involves a SlotZero Algorithm and a NextToCaller Algorithm.
- SlotZero Algorithm always returns a location that maps to cache slot zero. This algorithm is expected to perfoim notably for the following reason: both inter-bee conflicts and DBMS-bee conflicts are going to occur, as long as there are more than four bees (the assumed slot associativity).
- NextToCaller Algorithm first finds the address of the bee's caller function by examining the machine code. Then it finds the slot this address maps to. This is called the caller slot.
- this algorithm calculates the size of bee in terms of cache lines and returns the slot that is immediately before the caller. Specifically, bees are grouped with their callers. NextToCaller Algorithm reduces inter-bee conflicts in comparison with SlotZero, yet the DBMS-bee conflicts remain unaddressed.
- Slot Pressure Profile Taxonomy A Slot Pressure Profile is an array of integers, one for each cache slot in an identified cache. It is named with the following format: per- ⁇ gramilarity> ⁇ source> ⁇ cache> ⁇ what> slot pressure profile" (or SPP).
- granularity indicates at which granularity the numbers are collected.
- a binary SPP contains values that are calculated from the binary executable without running it.
- a raw SPP contains values that are output from a profiling tool during the execution of a specified workload.
- a reference SPP contains values that are computed over other SPP(s) by an algorithm.
- cache indicates which cache the numbers refer to. It can be (level 1) I- cache, (level 1) D-cache, or (integrated) L2-cache.
- FIG. 11 shows a block diagram illustrating workflow 1100 of SPP-guided bee placement algorithms with an embodiment of the present invention.
- the workflow 1100 comprises six phases: a SPP generation phase 1105, a SPP processing phase 1110, a Candidate ordering phase 1115, a slot selection phase 120. a Bee insertion phase 1125, and a query evaluation phase 1130.
- the algorithm In the SPP Generation phase 1105, the algorithm generates at DBMS compile time one or more SPPs by assigning weights to a binary SPP with external knowledge from the programmer, collecting a raw SPP over the DBMS executing a workload, or computing a reference SPP. This phase takes time but is done statically, before the DBMS starts. The remaining phases are lightweight and occur within the DBMS as part of the queiy optimization phase, so that the bees are placed when query execution commences. Each algorithm creates its own reference SPP from the provided (intermediate) SPP(s) in the SPP Processing phase. Each algorithm then generates a list of candidate slots in the Candidate Ordering phase by analyzing the reference SPP from the previous phase.
- the candidate slot list contains a sequence of slot numbers that will be used to successively locate bees (a slot could be present several times in this list). (The MaxResidency algorithm emits several lists; the rest emit a single list.) The Slot Selection, Bee Insertion, and Query Evaluation phases work as before.
- the input to SPP Generation phase 1105 is the binary of the DBMS.
- the output is a single intermediate SPP.
- the SPP Generation phase 1105 involves three distinct SPP generation algorithms, each utilizing different input workloads.
- the input to this stage is the DBMS binary and the workload.
- the output is a single intermediate SPP per operator.
- the call graph for DBMS is first obtained for executing the representative workload, which involves the operators being cared about. Then a breadth-first search is performed for each query evaluation operator to identify all the functions it may call. However, an operator A sometimes reaches a function F through another operator B; in that case, F shouldn't be considered to be reachable by A.
- this algorithm generates a per-function binary I-cache function score SPP by assigning 1 to its associated slots. Adding up the SPPs for an operator yields a per-operator reference I-cache function count SPP.
- a slot with a low number of misses may still be full, such that adding a bee to that slot will evoke many misses, due to DBMS-bee conflicts and perhaps inter-bee conflicts.
- the MaxResidency algorithm makes a prediction of the size of the working set.
- W denote the associativity of the I-cache. 1 012 1
- n address accessed within a particular execution (that is, of the workload)
- each section will contribute a residency, and this address will be associated with a sequence of that many residencies.
- the first, pressure threshold is the ratio of the instruction count at a slot to the average instruction count of all slots.
- This parameter is considered to constitute pressure for the associated slot.
- a value of 0.05 is used.
- a low value ensures most cache slots are further considered.
- the second, residency threshold is the ratio of the residency to the total instruction count at this slot that is considered long enough to occupy the cache line.
- a value of 0.15 is used. This value should not be greater than ⁇ /W (for example, 0.25) to be meaningful. The lower it is set to, the more the algorithm considers a residency to be in the main loop. Note that these parameters are subjective. It will be seen that these specific values obtain good performance.
- the Max esidency algorithm first computes an instruction threshold, to identify slots that are accessed often enough to potentially incur cache pressure. Then it checks the max residency of this slot. If one residency accounts for 15% of the total counts at this slot, this residency is considered to be in the working set and estimated occupancy at this slot incremented by 1.
- MaxResidency Algorithm is reviewed with three concrete examples. For simplicity, it is assumed that W is 4 and the cache has only two slots: 0 and I, with a memory address mapping to a cache slot by last bit.
- Example 1 instruction sequence with a small working set. For this first example, consider an instruction sequence S t consisting of ten repetitions of address 11 followed by address 21 (denoted 1 1, 21), both of which map to slot 1. Thus the size of the working set of S ⁇ at slot 1 is thus 2 because there are two distinct blocks. How does MaxResidency algorithm infer 2? It first sees that average instruction count is 10 (total instruction count / number of cache slots).
- the threshold value is 0.5 (pressure threshold x average instraction count; cf. line 5 in Algorithm 1) to filter out slots with few accesses.
- Slot 0 has 0 ( ⁇ 0.5) instructions so gets 0 as estimated occupancy immediately (cf. line 4 in Algorithm 2).
- Slot 1 has 20 (>0.5) instructions so should be further processed.
- the max residency of slot 1 is ⁇ 10, 10 , 0,0>. 10/20 is larger than the residency threshold, so the algorithm concludes that slot l 's estimated occupancy is 2 (cf. line 8 of Algorithm 2).
- Example 2 instruction sequence with a large working set.
- Example 3 instruction sequence with multiple loops.
- S 3 is ten repetitions of instruction addresses 1 , 11, 21 and S4 is twenty repetitions of instruction addresses 31, 41 , 51, 61, 71 , 81.
- the algorithm will compute an average instruction count of 75 and will filter slot 0 out.
- the max residency at slot 1 is ⁇ 10,10,10, 1>.
- the first is residency of address 1, then address 11, then address 21; all others are 1.
- Even the largest 10/150 is smaller than residency threshold. This indicates that the top W residencies do not come from the major loop. So it concludes that slot 1 's estimated occupancy is 5.
- the actual size of slot 1 's working set is known as 7 (3+4). This example demonstrates that MaxResidency is not misguided by the small loop.
- the result of this analysis is a per-operator reference I-cache MaxResidency SPP for each operator, stating the estimated occupancy of each slot, computed by running the DBMS on the workload at compile time.
- An SPP for the bee invocation code can be created by identifying their machine code fragments and their slots. Note that only one such SPP is computed, termed the per- workload reference I-cache bee invocation SPP.
- SPP Processing Phase 111.0 receives the intermediate SPP obtained in SPP Generation phase 1105 and outputs a reference SPP (or profile).
- the SPP Processing Phase 1110 comprises one or more algorithms selected from Binary Function Score Algorithm, Function Count Algorithm, and MaxResidency Algorithm. These algorithms infer the cache pressure, construct a reference SPP, and give the query plan. In embodiments, each algorithm calculates its reference SPP differently.
- Function Count Algorithm takes into consideration that for each particular queiy, only the functions related to the plan's operators will be executed, and the number of times a function is executed is exactly that operator's cardinality.
- the algorithm constructs the reference SPP by multiplying the SPP of each operator in the plan by its input cardinality, then summing up these per-operator SPPs to get a final SPP, the per-query reference I-cache inner-loop score SPP.
- DBMSes use the so-called "Volcano" query evaluation model that delivers a result tuple by traversing nodes in the query plan tree, each returning a tuple for use in the parent node. Each node in the queiy tree is evaluated to produce a tuple, and so to determine the cache pressure for the entire query, the per-operator SPPs can be added up, for those operators appearing in the query plan, to obtain the SPP of the whole query.
- PostgreSQL utilizes the concept of a blocking operator, which calls its child operators potentially many times, thereby computing the entire result of its subquery before returning its first result. Subsequent calls will just return successive tuples, without further evaluation of its child operators.
- Each blocldng operator creates a stage, a subset of the operators in the plan that all execute to completion, while the rest of the operators wait. The only exception is that the root operator of the query plan tree always creates a stage even it is not a blocking operator. This provides an opportunity to estimate the cache pressure of each stage.
- FIG. 12 is a schematic diagram 1200 showing stages of a concrete query plan tree with an embodiment of the present invention.
- hash (1222) is emphasized as it is a blocking operator.
- the diagram has a Stage 0 (1210) and a Stage 1 (1220).
- Stage 0 (1210) is rooted at the root hash join 1212 and also comprises a sequential scan operator 1214 (applied to table B) that is coupled to the hash join 1112.
- Stage 1 ( 1220) is rooted at hash 1222 and also comprises a sequential scan operator 1224 (applied to table A) that is coupled to the hash 1222.
- the root hash join 1212 couples to the root hash 1222.
- the operator node where a bee is invoked the bee operator and the operators that are in the same stage as the bee operator are termed the active operators.
- the active operators there well could be stages below this blocking operator; these stages will be executed in totality during the first call of the blocking operator, but will not generate cache pressure after that, and thus do not contribute cache pressure for this stage.
- the active operators in that lower stage will no longer contribute cache pressure for the upper stage.
- the stages beside this stage in the query tree. Hence, only operators within the stage, the active operators, contribute cache pressure.
- the MaxResidency algorithm first identifies the stages within the plan tree for the query, computes an SPP for each stage, and then finally uses those SPPs to compute an SPP for each bee caller, which is then used to place bees to be invoked by that caller. There are several subcases to consider:
- Subcase 1 If a particular bee is invoked (either directly or indirectly) in a single stage, the SPP associated with that stage is then used to place that bee.
- Subcase 2 It may be the case that two instances of the same operator within a query plan each call a bee, but each bee would be specific to that operator, and so in this case there are two specific bees, each associated with one of those operator instances. Each of those bees would be placed using the associated per-stage SPP.
- Subcase 3 It may also be the case that a specific bee, specialized from particular invariant values, may be invoked (perhaps indirectly) by multiple operator nodes. In that case, this algorithm will conservatively treat the combination of the requisite SPPs of those stages as the resulting SPP: each slot would have the maximum of the associated slots of the constituent SPPs. The rationale is that each stage will execute separately, and so the bee would see only the pressure from one stage at any point in its execution.
- the bee invocation SPP should be added to the per-operator SPPs to render a per-stage reference I-cache MaxResidency SPP for each stage.
- Candidate Ordering Phase 1115 accepts a reference SPP from the previous phase and produces a candidate slot list by sorting the values in the SPP. (For MaxResidency, this is on a per-stage basis.) Algorithm 3 finds the next candidate- slot fro in a SPP. Assuming that the bee takes consecutive blocks, Algorithm 3, as shown in Table 3, will choose the starting slot for the next bee, a slot that exhibits the least estimated cache pressure, according the SPP.
- the inputs to this algorithm are the beeSize, which is the number of 64- byte blocks needed to accommodate the bee, and usedSlots, which contains slots that have been occupied by earlier bees; this set starts out empty.
- the algorithm calculates for each slot its estimated cache pressure: the maximum value of slotsTaken consecutive slots following it, if the slot hasn't been used yet. (Repeating slots will be dealt with soon.)
- An array extralmpact records the estimated cache pressure of each slot. The minimum value in extralmpact (of this additional bee) is identified and its index is returned as the candidate slot number. Just before returning the candidate slot, the occupied slot(s) are added to usedSlots to avoid conflict with future bee placement.
- some implementation details are omitted in Algorithm 3 as they are not tightly related to the idea and are straight-forward. For example, a boundary check is needed near the end of array SPP, when locating a large bee in one of the last slots.
- Algorithm 3 returns only the next candidate slot. To perform Candidate Ordering, this algorithm is successively invoked. The caller is responsible for resetting the usedSlots variable when it gets full.
- extralmpactiij afi3 ⁇ 4et €d.Slots.iiiiK.() ;
- the MaxResidency Algorithm requires some extra effort with the usedSlots variable to ensure that the right number of bees associate to each cache slot.
- the bee's size is 1 block
- the cache's associativity is 4, and the MaxResidency SPP for that stage has 3 in slot A and 1 in slot B, which means slot A can accommodate one bee while slot B can accommodate three bees.
- the algorithm will first suggest using slot B then slot A. At this point, only slot A is needed to be removed, and continue. Slot B would be removed after it had been indicated three times. Once this occurs and slots are ran out, the process simply returns to generating a uniform sequence of bee locations for the remaining bees.
- the evaluation involves several independent and several dependent variables.
- the hardware configuration is one of the independent variables.
- An I-cache configuration of a 128-slot, 64B-block, 4-way associative LI cache (separate instruction and data caches) and a 8192 slot, 64B block, 16-way associative L3 cache (combined instruction and data) is adopted for both cachegrind simulation and for empirical evaluations on real hardware.
- the physical machine has an additional intermediate-level cache: a 512 slot, 64B block, 8-way associative combined cache.
- the physical machine has 7 GB of main memory.
- the other independent variables are the training workload, the specific queries being evaluated, the size of bee code, the number of DBMS bee-invocation functions (50), the total number of invoked bees, and the placement algorithm used.
- the dependent variables are the miss rates (that is, scaled by number of references, and thus is a percentage) at LI I-cache and L2 levels, as those are the metrics that will be most impacted by bee placement, and the bee placement algorithm time.
- the bee code used is a constant sequence of 10 instructions (that is, there is only one bee that is to be replicated as needed) that does not actually speed up the DBMS.
- the point is, a realistic bee improves DBMS performance, but this improvement is reduced by cache misses as a result of bad placement.
- the actual bee code is not relevant to the purposes, only that it runs some code, thereby generating instruction fetches.
- Fig. 13 shows an exemplary flow diagram 1300 for the evaluation.
- the experiment protocol consists of three consecutive steps. Step 1305 is done statically, at DBMS compile time, and Steps 1310 and 1315 are done dynamically, during query planning.
- the first step 1305 determines whic functions will invoke a bee; such functions are termed the bee callers.
- the DBMS PostgreSQL 9.3.10 are first compiled with gec (version 4.6 used, with the -02 option). Since the goal of micro- specialization is to optimize often-used code, the top fifty most-used functions are then identified within query evaluation from executing the training workload sequentially (TPC-H queries, omitting the five queries listed in the next section). For each so-identified function, a call is added in the source code to bee code (one or more calls, depending on the number of bees). A call graph profile is also obtained by running callgrind on that same workload. An operator-caller mapping is then generated by associating with each operator those functions that are called directly or indirectly by that operator. If two operators can reach the same function, the function will be associated with both operators.
- the second step 1310 and third step 1315 are performed when the stock and specialized versions of PostgreSQL are ran, with the latter having the bee callers invoke the specified number of bees, generating relevant performance statistics.
- the TPC-H queries are also used, so that the requested number of bees will indeed be called.
- a constant number of bees (400) is placed for each query to generate a significant but representative amount of cache pressure for the algorithms to contend with. Note that more than 512 bees could overwhelm the physical instruction cache (as that is the number of slots).
- the second step 1310 occurs during queiy planning, for each query. It computes the number of bees needed to be called by each bee caller (an integer for each query), to ensure that the required total number of bees (an independent variable) are indeed invoked. Specifically, using the operator-caller mapping from the previous step, this step divides the total number of bees by the number of bee callers. (Functions that can be reached from multiple operators will be appropriately given multiple sets of bees.) When bees remain, that many bee callers are picked randomly to call one more bee. This implementation guarantees that each query will have exactly the same number of bees, though different sets of bees, determined by the operator(s) used.
- the set of bees to be executed for each bee caller associated with that operator, computed in the previous step, are placed in each node of the query plan.
- the selected placement algorithm is then invoked to place those specific bees (some of the algorithms utilize the query in this placement).
- I-cache miss rate when bees are added to the stock query execution. Note that this is different from what will happen in practice, in two ways. First, each query will involve a different number and different set of bees. Due to the complexity of the DBMS, and the desire to have some uniformity in the experiments, the simpler arrangement of 400 identical one-slot bees arc set up to be invoked by each query. Secondly, in proactive each query will replace stock code with a more efficient bee, whereas the bees are added to the stock code, which actually puts a greater bur- den on the placement algorithms, as there was more cache pressure with the added bees. Given this experimental arrangement, it doesn't make sense to measure query time. Rather, the focus is on I-cache miss rate for each query, as that query invokes 400 added bees at 50 invocation sites within the DBMS.
- the objective of the disclosed cache placement algorithms in this setting is to affect the minimum cache pressure, as indicated by I-cache miss rate.
- Using a placement algorithm that does so should also perform well in the more realistic setting of a variable number of variable-sized bees, as all the algorithms attempt (with varying degrees of success) to place the added bees in valleys identified by the profiles available to that algorithm.
- each bee can fit in 64 bytes, or one slot.
- TPC-H scale factor 1 is used as the representative workload omitting queries 2, 17, 20 and 21 because they ran too slowly, and omitting query 15 because that query is a view. Because cache performance are optimized, the scale factor is not relevant, as long as the data does not fit in the L2 cache, which is the case for scale 1.
- TPC-H is also used, scale factor 1, as the training workload. This is of course unrealistic. This is done because it presents the best-case scenario for these algorithms, of having the training workload be the same as the evaluation workload.
- the weighted and unweighted average I-cache miss rate across the TPC-H queries is first examined, provided in the last two rows of Table 4.
- the unweighted average is just the average of cache miss rate over the queries (thus treating them equally), while the weighted average is weighted by each query's instruction count, thus queries 1 and 18 con- tribute more than rest to that average.
- the fastest algorithm for each query in boldface is indicated. Note that MaxResidency performs significantly better than all of the other algorithms in both cases.
- NextToCaller is inferior to SlotZero in terms of the weighted average metric, which is unexpected. This anomalous result occurs because SlotZero only affects thecache pressure on one slot. NextToCaller, however, spreads the cache pressure across no more than 50 cache slots (recall that this experiment introduces 50 bee callers into the DBMS); thus slots could suffer from thrashing and attain even worse performance than SlotZero.
- MaxResidency excels for 12 queries.
- queries 9, 14, and 16 the performance difference between MaxResidency and that of the best algorithm (for that query) was minimal.
- Query 1 happens to have only two phases, so MaxResidency's ability to handle complex query trees is not fully utilized. (Note that even though this query incurs a lot of instructions, MaxResidency's miss rate for this query is close to that of BinaryFunctionScore.)
- Table 5 The MaxResidency threshold setting adopted (0.15) might play a role. That said, for most queries MaxResidency significantly bettered BinaryFunctionScore.
- the number of bees is also varied: 200 as a light work- load; 400 as a medium workload; and 500 and 1000 as heavy workloads. With all, MaxResidency produced the best results, for both weighted and unweighted. To indicate how well MaxResidency works in other cases, the weighted cache miss rate average is mentioned: 0.213% for 200 bees; 0.275% for 500 bees; and 0.498% for 1000 bees.
- MaxResidency cache miss rate changes in Figure 14.
- Query 4 is marked with square while weighted average is marked with diamond (it is shown query 4 as it has the most significant change as the number of bees is varied).
- query 4 At 1000 bees, their cache miss rates between q4 and the average, demonstrating that for query 4 the cache miss rate goes up sharply after the I-cache slots have been filled up. The transition appears to be between 400 and 500. This makes sense, because the I-cache has 512 slots and DBMS stock code of course (partially) occupies many of them.
- NextToCaller achieved the minimum I-cache miss rate for query 6. However, its execution time was almost twice as longer as the others. The reason for such significant slow-down is that NextToCaller' s L2-cache miss rate is significantly higher than others algorithms. As L2-cache misses are much more expensive that I-cache misses, this effect dominates the run time. L2-cache misses in the table discussed above (with fifty bee callers) is not shown because the I-cache misses in Table 3 are spread across a large number of L2 cache slots, resulting in a uniformly low L2-cache miss rate, less than 0.012% in all cases.
- Huang et al. [X. Huang, S. M. Blackburn, D. Grove, and K. S. McKinley. Fast and efficient partial code reordering: Taking advantage of dynamic recornpilation. In Proc. International Symposium on Memory Management, pages 184—192, New York, NY, 2006] proposed Code Reordering, focusing on direct- mapped caches, with an associativity of 1. Sherwood et al. [T. Sherwood, B. Calder, and J. Emer. Reducing cache misses using hardware and software page placement. In Proc.
- MaxResidency dynamic bee placement algorithm minimizes I-cache pressure for workloads with as many as 1000 bees.
- aggressive micro-specialization of a large number of functions invoked during query evaluation can improve query evaluation performance without concern for increased I-cache pressure and thus also L2-cache pressure, through the use of this highly- effective bee placement algorithm.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
Claims
Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US14/968,296 US10365900B2 (en) | 2011-12-23 | 2015-12-14 | Broadening field specialization |
| PCT/US2016/066665 WO2017106351A1 (en) | 2015-12-14 | 2016-12-14 | Broadening field specialization |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| EP3391192A1 true EP3391192A1 (en) | 2018-10-24 |
| EP3391192A4 EP3391192A4 (en) | 2019-11-06 |
Family
ID=59057797
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| EP16876594.9A Withdrawn EP3391192A4 (en) | 2015-12-14 | 2016-12-14 | ENLARGEMENT OF FIELD SPECIALIZATION |
Country Status (4)
| Country | Link |
|---|---|
| EP (1) | EP3391192A4 (en) |
| CN (1) | CN108369500A (en) |
| CA (1) | CA3007414A1 (en) |
| WO (1) | WO2017106351A1 (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112035099B (en) * | 2020-09-01 | 2024-03-15 | 北京天融信网络安全技术有限公司 | Vectorization representation method and device for nodes in abstract syntax tree |
Family Cites Families (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5999924A (en) * | 1997-07-25 | 1999-12-07 | Amazon.Com, Inc. | Method and apparatus for producing sequenced queries |
| WO2002071260A1 (en) * | 2001-03-01 | 2002-09-12 | Aalborg Universitet | Adaptable query optimization and evaluation in temporal middleware |
| US7254810B2 (en) * | 2002-04-18 | 2007-08-07 | International Business Machines Corporation | Apparatus and method for using database knowledge to optimize a computer program |
| US7805456B2 (en) * | 2007-02-05 | 2010-09-28 | Microsoft Corporation | Query pattern to enable type flow of element types |
| US7937243B2 (en) * | 2007-08-03 | 2011-05-03 | Ailive, Inc. | Method and apparatus for non-disruptive embedding of specialized elements |
| US7895171B2 (en) * | 2008-03-27 | 2011-02-22 | International Business Machines Corporation | Compressibility estimation of non-unique indexes in a database management system |
| US9135405B2 (en) * | 2011-05-26 | 2015-09-15 | Carnegie Mellon University | Automated exploit generation |
| US8793240B2 (en) * | 2011-08-26 | 2014-07-29 | Oracle International Corporation | Generation of machine code for a database statement by specialization of interpreter code |
| KR101615907B1 (en) * | 2011-12-16 | 2016-04-27 | 인텔 코포레이션 | Method and system using exceptions for code specialization in a computer architecture that supports transactions |
| EP2795484A4 (en) * | 2011-12-23 | 2015-11-11 | Univ Arizona State | METHODS OF MICRO-SPECIALIZATION IN DATABASE MANAGEMENT SYSTEMS |
| CN102981851B (en) * | 2012-11-15 | 2016-02-03 | 深圳市共进电子股份有限公司 | A kind of fast Development maintenance system of Embedded Network Device interface languages and method |
| CN104951489B (en) * | 2014-09-04 | 2017-09-15 | 国网山东省电力公司应急管理中心 | A kind of meteorological data analyzing and processing method applied to power system |
-
2016
- 2016-12-14 CN CN201680073554.1A patent/CN108369500A/en active Pending
- 2016-12-14 EP EP16876594.9A patent/EP3391192A4/en not_active Withdrawn
- 2016-12-14 CA CA3007414A patent/CA3007414A1/en not_active Abandoned
- 2016-12-14 WO PCT/US2016/066665 patent/WO2017106351A1/en not_active Ceased
Also Published As
| Publication number | Publication date |
|---|---|
| WO2017106351A1 (en) | 2017-06-22 |
| EP3391192A4 (en) | 2019-11-06 |
| CA3007414A1 (en) | 2017-06-22 |
| CN108369500A (en) | 2018-08-03 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Talati et al. | Prodigy: Improving the memory latency of data-indirect irregular workloads using hardware-software co-design | |
| Lv et al. | A survey on static cache analysis for real-time systems | |
| US8132162B2 (en) | Runtime machine analysis of applications to select methods suitable for method level caching | |
| US8539455B2 (en) | System for and method of capturing performance characteristics data from a computer system and modeling target system performance | |
| Kanev et al. | Mallacc: Accelerating memory allocation | |
| Annavaram et al. | Call graph prefetching for database applications | |
| Lee et al. | MCC-DB: Minimizing cache conflicts in multi-core processors for databases | |
| US10733099B2 (en) | Broadening field specialization | |
| Song et al. | Thermometer: profile-guided btb replacement for data center applications | |
| Kiani et al. | Efficient cache performance modeling in GPUs using reuse distance analysis | |
| Xie et al. | CRAT: Enabling coordinated register allocation and thread-level parallelism optimization for GPUs | |
| Huber et al. | Worst‐case execution time analysis‐driven object cache design | |
| Abdi et al. | A community cache with complete information | |
| Nicholson et al. | Hpcache: memory-efficient OLAP through proportional caching | |
| Luo et al. | Harvesting memory-bound {CPU} stall cycles in software with {MSH} | |
| Memarzia et al. | The art of efficient in-memory query processing on NUMA systems: a systematic approach | |
| Popov et al. | Piecewise holistic autotuning of parallel programs with cere | |
| WO2017106351A1 (en) | Broadening field specialization | |
| Nicholson et al. | HPCache: memory-efficient OLAP through proportional caching revisited | |
| US10747515B2 (en) | Fields hotness based object splitting | |
| Hua et al. | Data similarity-aware computation infrastructure for the cloud | |
| Moura et al. | Learning to rank graph-based application objects on heterogeneous memories | |
| Chakraborty et al. | Integrating software caches with scratch pad memory | |
| Memarzia et al. | Toward Efficient In-memory Data Analytics on NUMA Systems | |
| He | Towards a Transparent and Efficient Far Memory System |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
| 17P | Request for examination filed |
Effective date: 20180608 |
|
| AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR |
|
| AX | Request for extension of the european patent |
Extension state: BA ME |
|
| DAV | Request for validation of the european patent (deleted) | ||
| DAX | Request for extension of the european patent (deleted) | ||
| RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 7/00 20060101ALI20190621BHEP Ipc: G06F 8/30 20180101ALI20190621BHEP Ipc: G06F 9/445 20180101ALI20190621BHEP Ipc: G06F 8/41 20180101AFI20190621BHEP Ipc: G06F 16/2453 20190101ALI20190621BHEP |
|
| A4 | Supplementary search report drawn up and despatched |
Effective date: 20191007 |
|
| RIC1 | Information provided on ipc code assigned before grant |
Ipc: G06F 7/00 20060101ALI20190930BHEP Ipc: G06F 9/445 20180101ALI20190930BHEP Ipc: G06F 16/2453 20190101ALI20190930BHEP Ipc: G06F 8/41 20180101AFI20190930BHEP Ipc: G06F 8/30 20180101ALI20190930BHEP |
|
| STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
| 18D | Application deemed to be withdrawn |
Effective date: 20200603 |