Techniques of Dynamic Run-time
TABLE OF CONTENTS
Optimization is a critical component in the processing of
each query submitted to a database management system (DBMS). The difference
between a properly optimized query and an improperly optimized one is huge, as
a query following a suboptimal plan can take exponentially more time than the
same query executed following an optimal plan. To determine the most efficient
manner in which to execute a given query, the DBMS must evaluate execution
strategies and choose among the most favorable calculated.
In recent years, database researchers have begun to rethink the mechanisms for such query optimization. In part, this is due to the expansion of the internet and the distributed queries this has made possible, but the techniques uncovered can also be applied to non-distributed environments. The research focuses on the consideration of processing factors at the time that queries are executed, and the adaptation of their execution in response to real-time data sizes and transfer rates.
This paper will outline the growing research field of adaptive query processing. I will start by first briefly explaining the mechanisms in existing, traditional query processing common today. I will then describe the problems with traditional query processing that adaptive query processing seeks to overcome. Lastly, I will outline various experimental solutions that show great promise.
Traditional query processing is dominated by statistics collection and probability theory. Basically, the DBMS tracks its relations and tuples in those relations at various intervals. When a query is submitted, it tries to estimate an optimal query plan by using this collected information. The execution costs of different plan computations are then compared; the DBMS picks the best one and executes it.
In most commercial DBMSs, the optimal execution plan for a given query is stored in memory and reused each time the same query is executed. By determining the query plan at compile time and then caching the results, the DBMS ensures that it is not incurring the overhead of recalculating a query plan for the same query each time it is executed. The DBMS will only recompile the query plan when there are relation schema changes, a significant change in the collected statistics, or when the cache is full and a previously compiled plan has been aged out of the cache.
The collected statistics include relation statistics, such as the number of records, number of blocks, and number of distinct values in the relation. For each attribute in the relation, statistics are kept on the attributes’ selectivity, which is the average number of records that will satisfy an equality selection condition. These statistics are estimated over a small subset of the entire relation because collecting statistics on the entire relation could be extremely time consuming.
Using these statistics, the probabilities of relation characteristics can be determined. For example, if a given attribute is a relation’s only primary key, the DBMS knows that each tuple selected using that attribute will be unique. In other situations, the number of distinct values in the analyzed subset is extrapolated for the entire relation. This provides the likelihood that the specified attribute produces a unique tuple.
Analyzing and estimating query plans in the probabilistic manner described above has several inherent flaws. At compilation-time, the statistics necessary to compute an optimal query plan may be unavailable or of poor quality. Additionally, the performance of any given plan may differ as available memory fluctuates, producing suboptimal results at different times in different situations depending on the server’s load. In distributed queries, network delays, node inefficiencies, and the potentially heterogeneous nature of the data make for suboptimal execution plans.
In some cases, it is not possible at compile time to gather accurate statistics about the data sources; additionally, potentially outdated statistics do not reflect actual data patterns. Adaptive query processing seeks to collect statistics not available or insufficient at compile time to ensure their accuracy.
Memory fluctuations refer to the need to adapt to memory shortages and to the availability of excess memory. Memory shortages can happen due to competition from higher-priority transactions, for example. In that case, running query plans may be forced to release some or all of the resources they hold.
Network delays are caused by differing I/O rates by different nodes at different times depending on network traffic. Adaptive queries target three types of network delays, including an initial delay (a long wait for the first tuple), slow delivery (constant, slow rate of tuples), and bursty arrival (fluctuation between many tuples and none). Generally, the response times of remote data sources are unpredictable. A node in the network may be idle, unburdened, etc. at compilation-time but busy at run-time, and the network pipe may be free at compilation-time but full at run-time. These inconsistencies make reliable optimal execution plans difficult to determine.
It would also be advantageous to display partial, incremental results whenever possible, so that the user issuing the query can see at least some results as soon as possible.
In distributed queries, the cost of network data transfer dominates the cost of the execution plan. Consider the following algebraic query [Elmasri, Navathe, 2003]:
Πfname, lname, dname (EMPLOYEE dno=dnumber DEPARTMENT)
Also consider the following factors:
With these parameters, three different strategies to perform the join can be analyzed:
Clearly, the third option requires the least amount of data to be transferred and will be at least 2x faster than the next best option. However, as the rate of data transfer changes between nodes due to network traffic, the optimality of each of these options changes. As such, it is clearly next-to-impossible to compute an optimal query plan at compile-time.
Enter adaptive query processing. Adaptive query processing refers to a set of techniques devised to combat the inherent flaws of traditional query processing. It is aimed at producing optimal query plans at times that traditional plans fail. These techniques come in two flavors: proactive operators, which are coded into an execution plan and autonomously adapt to run-time changes, and reactive optimizers, which reform the plan and reorder pre-existing operators due to run-time conditions. For each of these types, I will examine two implementations.
To accomplish this, XJoin splits the hash tables
into memory-resident partitions and disk-resident partitions. During periods of
network congestion, tuples will be slow to arrive at the site executing the
join. XJoin will move these arriving tuples directly to the hash tables in
memory and apply the join operator. Contrarily, during periods of accelerated,
bursty activity, the data flow rate will exceed to speed at which tuples can be
matched in memory. XJoin stores these tuples to the disk-resident partitions at
the rate they arrive. Later, when periods of inactivity resume, joins involving
these tuples stored to disk can be applied.
In this way, XJoin is always doing something, regardless of the speed at which tuples are arriving. There are memory-to-memory matches, disk-to-memory matches, and disk-to-disk memory matches. To ensure that all matching tuples are produced, XJoin includes a clean-up phase that runs after all tuples arrive. This ensures that matches for tuples not present in the same partition at the same time are processed.
Eddies [Avnur, Hellerstein, 2000] are a more conceptual implementation of an adaptive query processing operator. The basic idea is that an entity known as an eddy is added to a query’s execution plan to control the various operators in a query. As can be seen from the figure below, data involved from these operations (running as independent threads) moves through the eddy. Because the eddy functions as a central unit between each operator, it can adaptively choose the best order to route tuples and run each successive operator.
The eddy functions by maintaining a priority queue of all tuples needing processing. As tuples are moved from one operator to the next, their priority level increases. This ensures that tuples at later stages of processing are processed first. Additionally, a tuple’s priority is adjusted based on the consumption and production rates of the operators it needs to be processed by. Because a low-cost operator can consume tuples more quickly than a high-cost one, it represents a smaller percentage of the total processing time required and can be given more tuples earlier. The eddy “learns” the operators’ relative performance by tracking the rates at which tuples are routed through them.
The technique of Query Scrambling [Urhan,
Query Scrambling works by associating a timer with each operator that accesses data from a remote site. The timer is started when the operator begins waiting for data and is stopped when the data arrives. If the timer goes off before data arrives from the remote site, a new thread is launched to process a different subsection of the query. When tuples from the delaying source finally arrive, that subsection of the query is resumed and processing can continue. A node in the execution plan, therefore, is in one of several states described by the diagram below:
It is important to note that such reactive processing may increase the overall cost of the query in terms of memory, I/O, etc. at the expense of faster execution time.
The set of optimizations known as Mid-query Re-optimization [Kabra, DeWitt, 1998] describe the collection of additional statistics at key points during query execution. This technique counters the fact that statistics collected by the DBMS might be outdated at the time of query execution (i.e. many tuples might have been added to a relation that now render the collected statistics invalid). The statistics collected by Mid-query Re-optimization can be very specific, because the exact need and use is known at run-time by the currently executing query.
Based on these statistics collected as the query runs, the optimizer knows that it has accurate statistics and can improve resource allocation or change the execution plan for the remainder of the query if suboptimality is detected. It is of course of vital importance to carefully determine when and how to collect the necessary statistics, and Mid-query Re-optimization includes many heuristics to deal with this. For example, a statistics collection routine can compute average tuple size without interrupting the normal execution of a query by examining tuples produced by a SELECT operator and keeping a running count of the number of matching tuples. This can be done during the same pass of the input data produced for output, ensuring that the run-time statistics collection does not slow down the query execution.
Kabra and DeWitt’s experimental results have been very promising. Although queries with simple joins (0-1 relations) show no improvement in execution time, queries with joins of medium complexity (2-3 relations) show mild improvement and queries with complex joins (4+ relations) show significant improvement.
The techniques described in this paper represent a major step forward in the evolution of query processing. Adaptive query processing can be very powerful in distributed systems where its techniques are required, and this power also shows promising results in non-distributed systems. As with other heuristic techniques, there is a great balance between performance gains and the cost of doing additional processing, statistics maintenance, query recompilation, and the like. As time goes by, research will no doubt uncover additional methods that further increase the efficiency of query processing, yielding faster query execution time and more optimized execution plans.
(1) A. Gounaris, N. W. Paton, A. Fernandes, and R.
Sakellariou. Adaptive Query Processing: A Survey. In 19th British
National Conference on Databases, pages 11-25,
This paper presents an overview of the history and recent developments of adaptive query processing. It includes a brief description of several techniques, the power they provide, and their adaptive abilities.
(2) J. M. Hellerstein, M. J. Franklin, S. Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M. Shah. Adaptive query processing: Technology in evolution. IEEE Data Engineering Bulletin, 23(2):7-18, 2000.
This paper describes the Telegraph project
developed at the
(3) Z. Ives, A. Levy, D. Weld, D. Florescu, and M. Friedman. Adaptive query processing for internet applications. IEEE Data Engineering Bulletin, 23(2):19-26, 2000.
This paper describes adaptive query optimization as it relates to the benefits and fallacies of a wide-area system using the Internet. Specifically, it analyzes the need to adjust processing based on I/O delays and data flow rates, sharing and re-use of data across multiple queries, and the ability to output partial, incremental results.
(4) G. Graefe. Dynamic Query Evaluation Plans: Some Course Corrections? IEEE Data Engineering Bulletin, 23(2):4-7, 2000.
This paper seeks to realign the focus of adaptive query processing to include query execution and physical database design. The author believes that by tying the three together, a broader understanding of the issues involved in adaptive optimizing will be understood.
(5) R. Avnur and M. Thomas. Continuously
Re-optimizing Query Processor. Technical Report,
This paper describes a query processor that does not rely on statistics gathering, but instead adapts to the performance of the query operators. The study focused on select operators and pipelined index-joins and hash ripple-joins.
(6) R. Avnur and J. Hellerstein. Eddies: Continuously adaptive query processing. Proceedings of the ACM SIGMOD, 2000, pages 261-272, 2000.
The authors present a learning algorithm that controls adaptations to run-time changes, including cost and selectivity of operators, in the execution environment.
(7) N. Kabra and D. J. DeWitt. Efficient Mid-Query Re-Optimization of Sub-Optimal Query Execution Plans. Proceedings of the ACM SIGMOD International Conference on Management of Data, Vol. 27, pages 106-117, ACM Press, NY, 1998.
The authors describe a system that detects sub-optimal query plans and attempts to correct them during execution. In this way, statistics collected at key points during execution are used to optimize; careful consideration of when and how to collect these statistics is necessary.
(8) K. Ng, Z. Wang, R. Muntz, and S. Nittel. Dynamic query re-optimization. Proceedings of 11th International Conference on Statistical and Scientific Database Management, pages 264-273, IEEE Computer Society, 1999.
This paper analyzes the situation in which system configuration and resource availability change during the execution of long running queries. Such queries are therefore executed using sub-optimal plans; this paper describes an approach to re-optimize while considering reconfiguration cost as well as execution cost.
(9) K. Ng, Z. Wang, and R. Muntz. Dynamic reconfiguration of sub-optimal parallel query execution plans. Technical Report, UCLA, 1998.
The authors propose an algorithm to coordinate the steps in a dynamic reconfiguration of query execution plans. Additionally, the authors suggest a syntactic extension to SQL to allow users to specify if and how query plans should be reconfigured for each query.
(10) L. Bouganim, F. Fabret, C. Mohan, and P. Valduriez. Dynamic query scheduling in data integration systems. Proceedings of ICDE 2000, pages 425-434, 2000.
This paper focuses on the problem of an unpredictable data rate in coordination with query scheduling. The approach performs a step-by-step scheduling of several query fragments, which are then processed based on data arrival. This yields significant performance gains.
(11) T. Urhan and M. Franklin. XJoin: A reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, pages 27-33, 2000.
This paper describes the design of XJoin, an operator capable of providing responsive query processing when accessing data from widely distributed sources. XJoin functions in a pipelined fashion, so that several concurrent tasks are each executing at different stages as appropriate.
(12) S. R. Madden, M. A. Shah, J. M. Hellerstein, and V. Raman. Continuously Adaptive Continuous Queries over Streams. Proceedings of the ACM SIGMOD International Conference on Management of Data, 2002.
The authors describe adaptive solutions to continuous queries – queries that run over an infinite, always changing data source. Because of their nature, such unbounded queries will run long enough to experience changes in system and data properties and adaptive mechanisms are necessary for optimal query execution.
(13) M. Shah, J. Hellerstein, S. Chandrasekaran,
and M. Franklin. Flux: An Adaptive Partitioning Operator for Continuous Query
Systems. Technical Report,
This paper introduces a dataflow operator called Flux (Fault-tolerant Load-balancing eXchange), which repartitions lookup-based operators while the pipeline is still executing. The mechanism provides great improvements in throughput and average latency for continuous queries.
(14) Z. Ives, D. Florescu, M. Friedman, A. Levy,
and D. S. Weld. An Adaptive Query Execution System for Data Integration. ACM
This paper presents the Tukwila data integration system, which uses adaptive-execution features to improve performance. The system was designed to overcome three inherent problems with traditional database query execution techniques when executing queries on remote sources: the absence of quality statistics, unpredictable and bursty data arrival rates, and frequent redundancy of the data sources.
(15) V. Zadorozhny, M.E. Vidal, L. Raschid, T. Urhan, and L. Bright. Efficient evaluation of queries in a mediator for websources. Under review, 1999.
This paper describes a two-phase Web Query Optimizer that efficiently searches a large space of plans at run-time to obtain a low cost plan (in comparison to a traditional optimizer).
(16) T. Urhan and M. J. Franklin. XJoin: Getting
fast answers from slow and bursty networks. Technical Report,
This paper is the initial technical report on XJoin, and it describes much of the background in the authors’ search for techniques to allow querying through heterogeneous and semi-structured databases from sources distributed around the world.
(17) J. Forrester and J. Ledlie. XJoin and the
Benefits of Free Work. Technical Report,
The authors report on their experience implementing the XJoin algorithm. Their implementation performed well, and the paper includes the source code they created.
(18) A.Y. Levy. Combining artificial intelligence
and databases for data integration. Technical Report,
This paper describes several adaptive AI techniques to solve problems of integrating data from a multitude of independent, autonomous data sources.
(19) L. Amsaleg, M. J. .Franklin, and A. Tomasic. Dynamic Query Operator Scheduling for Wide-Area Remote Access. Journal of Distributed and Parallel Databases, Vol. 6, No. 3, 1998.
The authors describe a technique they devised called Query Scrambling, which dynamically modifies query execution plans on-the-fly in reaction to unexpected delays in data access (when accessing remote sources). They focus on the dynamic scheduling of query operators and simulate the effects of the scheduler.
(20) H. Paques, L. Liu, and C. Pu. Distributed Query Adaptation and Its Trade-offs. Technical Report, Georgia Institute of Technology, 2002.
The authors describe an overview of Ginga, an adaptive query processing engine that combines proactive (compile-time) alternative query plan generation with reactive (run-time) monitoring of network delays.
(21) H. Paques, L. Liu, and C. Pu. Ginga: A SelfAdaptive Query Processing System. Technical Report, Georgia Institute of Technology, 2003.
The authors describe Ginga in greater detail, including its architecture and the various phases the system encompasses throughout its life cycle.
(22) C. Shahabi, L. Khan, D. Mcleod and V. Shah. Run-Time Optimization of Join Queries for Distributed Databases over the Internet. Proceedings of Communication Networks and Distributed Systems Modeling and Simulation, 1999.
The authors devise a run-time, probing mechanism for use in JOIN queries that produces optimal query execution plans in distributed systems.
(23) S. Chaudhuri. An Overview of Query Optimization in Relational Systems. Proceedings of the ACM PODS, pages 34-43, 1998.
An overview of the general issues in query optimization are described.
(24) L. Amsaleg, M. Franklin, and A. Tomasic. Query
Scrambling for Bursty Data Arrival. Technical Report,
The Query Scrambling technique is described in detail.
(25) R. Elmasri and S. Navathe. Fundamentals of Database Systems, Fourth Edition, 2003.
Our course textbook which provides an overview of all functions within a DBMS.