Adaptive Query Processing
Techniques of Dynamic Run-time
Optimization
Russell
Greenspan
December,
2003
TABLE OF CONTENTS
2. Traditional Query Processing
3. Problems With Traditional Query Processing
Optimization is a critical component in the processing of
each query submitted to a database management system (DBMS). The difference
between a properly optimized query and an improperly optimized one is huge, as
a query following a suboptimal plan can take exponentially more time than the
same query executed following an optimal plan. To determine the most efficient
manner in which to execute a given query, the DBMS must evaluate execution
strategies and choose among the most favorable calculated.
In recent years, database researchers have begun to
rethink the mechanisms for such query optimization. In part, this is due to the
expansion of the internet and the distributed queries this has made possible,
but the techniques uncovered can also be applied to non-distributed
environments. The research focuses on the consideration of processing factors
at the time that queries are executed, and the adaptation of their execution in
response to real-time data sizes and transfer rates.
This paper will outline the growing research field
of adaptive query processing. I will start by first briefly explaining the mechanisms
in existing, traditional query processing common today. I will then describe
the problems with traditional query processing that adaptive query processing
seeks to overcome. Lastly, I will outline various experimental solutions that
show great promise.
Traditional query processing is dominated by
statistics collection and probability theory. Basically, the DBMS tracks its relations
and tuples in those relations at various intervals. When a query is submitted,
it tries to estimate an optimal query plan by using this collected information.
The execution costs of different plan computations are then compared; the DBMS
picks the best one and executes it.
In most commercial DBMSs, the optimal execution
plan for a given query is stored in memory and reused each time the same query
is executed. By determining the query plan at compile time and then caching the
results, the DBMS ensures that it is not incurring the overhead of
recalculating a query plan for the same query each time it is executed. The
DBMS will only recompile the query plan when there are relation schema changes,
a significant change in the collected statistics, or when the cache is full and
a previously compiled plan has been aged out of the cache.
The collected statistics include relation
statistics, such as the number of records, number of blocks, and number of
distinct values in the relation. For each attribute in the relation, statistics
are kept on the attributes’ selectivity, which is the average number of records
that will satisfy an equality selection condition. These statistics are
estimated over a small subset of the entire relation because collecting statistics
on the entire relation could be extremely time consuming.
Using these statistics, the probabilities of
relation characteristics can be determined. For example, if a given attribute
is a relation’s only primary key, the DBMS knows that each tuple selected using
that attribute will be unique. In other situations, the number of distinct
values in the analyzed subset is extrapolated for the entire relation. This
provides the likelihood that the specified attribute produces a unique tuple.
Analyzing and estimating query plans in the probabilistic manner described
above has several inherent flaws. At compilation-time, the statistics necessary
to compute an optimal query plan may be unavailable or of poor quality.
Additionally, the performance of any given plan may differ as available memory
fluctuates, producing suboptimal results at different times in different
situations depending on the server’s load. In distributed queries, network
delays, node inefficiencies, and the potentially heterogeneous nature of the
data make for suboptimal execution plans.
In some cases, it is not possible at compile time
to gather accurate statistics about the data sources; additionally, potentially
outdated statistics do not reflect actual data patterns. Adaptive query
processing seeks to collect statistics not available or insufficient at compile
time to ensure their accuracy.
Memory fluctuations refer to the need to adapt to
memory shortages and to the availability of excess memory. Memory shortages can
happen due to competition from higher-priority transactions, for example. In
that case, running query plans may be forced to release some or all of the
resources they hold.
Network delays are caused by differing I/O rates by
different nodes at different times depending on network traffic. Adaptive
queries target three types of network delays, including an initial delay (a
long wait for the first tuple), slow delivery (constant, slow rate of tuples),
and bursty arrival (fluctuation between many tuples and none). Generally, the response
times of remote data sources are unpredictable. A node in the network may be
idle, unburdened, etc. at compilation-time but busy at run-time, and the network
pipe may be free at compilation-time but full at run-time. These
inconsistencies make reliable optimal execution plans difficult to determine.
It would also be advantageous to display partial,
incremental results whenever possible, so that the user issuing the query can
see at least some results as soon as possible.
In distributed queries, the cost of network data
transfer dominates the cost of the execution plan. Consider the following
algebraic query [Elmasri, Navathe, 2003]:
Πfname,
lname, dname (EMPLOYEE dno=dnumber
DEPARTMENT)
Also consider the following factors:
With these parameters, three different
strategies to perform the join can be analyzed:
Clearly, the third option requires the
least amount of data to be transferred and will be at least 2x faster than the
next best option. However, as the rate of data transfer changes between nodes
due to network traffic, the optimality of each of these options changes. As
such, it is clearly next-to-impossible to compute an optimal query plan at
compile-time.
Enter adaptive query processing. Adaptive query
processing refers to a set of techniques devised to combat the inherent flaws of
traditional query processing. It is aimed at producing optimal query plans at
times that traditional plans fail. These techniques come in two flavors: proactive
operators, which are coded into an
execution plan and autonomously adapt to run-time changes, and reactive optimizers, which reform the plan and
reorder pre-existing operators due to run-time conditions. For each of these
types, I will examine two implementations.
XJoin [Urhan,
To accomplish this, XJoin splits the hash tables
into memory-resident partitions and disk-resident partitions. During periods of
network congestion, tuples will be slow to arrive at the site executing the
join. XJoin will move these arriving tuples directly to the hash tables in
memory and apply the join operator. Contrarily, during periods of accelerated,
bursty activity, the data flow rate will exceed to speed at which tuples can be
matched in memory. XJoin stores these tuples to the disk-resident partitions at
the rate they arrive. Later, when periods of inactivity resume, joins involving
these tuples stored to disk can be applied.
In this way, XJoin is always doing something,
regardless of the speed at which tuples are arriving. There are
memory-to-memory matches, disk-to-memory matches, and disk-to-disk memory
matches. To ensure that all matching tuples are produced, XJoin includes a
clean-up phase that runs after all tuples arrive. This ensures that matches for
tuples not present in the same partition at the same time are processed.
Eddies [Avnur, Hellerstein, 2000] are a more
conceptual implementation of an adaptive query processing operator. The basic
idea is that an entity known as an eddy
is added to a query’s execution plan to control the various operators in a
query. As can be seen from the figure below, data involved from these
operations (running as independent threads) moves through the eddy. Because the
eddy functions as a central unit between each operator, it can adaptively
choose the best order to route tuples and run each successive operator.
The eddy functions by maintaining a priority queue
of all tuples needing processing. As tuples are moved from one operator to the
next, their priority level increases. This ensures that tuples at later stages
of processing are processed first. Additionally, a tuple’s priority is adjusted
based on the consumption and production rates of the operators it needs to be
processed by. Because a low-cost operator can consume tuples more quickly than
a high-cost one, it represents a smaller percentage of the total processing
time required and can be given more tuples earlier. The eddy “learns” the
operators’ relative performance by tracking the rates at which tuples are
routed through them.
The technique of Query Scrambling [Urhan,
Query Scrambling works by associating a timer with
each operator that accesses data from a remote site. The timer is started when
the operator begins waiting for data and is stopped when the data arrives. If
the timer goes off before data arrives from the remote site, a new thread is launched
to process a different subsection of the query. When tuples from the delaying
source finally arrive, that subsection of the query is resumed and processing
can continue. A node in the execution plan, therefore, is in one of several
states described by the diagram below:
It is important to note that such reactive
processing may increase the overall cost of the query in terms of memory, I/O,
etc. at the expense of faster execution time.
The set of optimizations known as Mid-query
Re-optimization [Kabra, DeWitt, 1998] describe the collection of additional
statistics at key points during query execution. This technique counters the
fact that statistics collected by the DBMS might be outdated at the time of
query execution (i.e. many tuples might have been added to a relation that now
render the collected statistics invalid). The statistics collected by Mid-query
Re-optimization can be very specific, because the exact need and use is known
at run-time by the currently executing query.
Based on these statistics collected as the query
runs, the optimizer knows that it has accurate statistics and can improve
resource allocation or change the execution plan for the remainder of the query
if suboptimality is detected. It is of course of vital importance to carefully
determine when and how to collect the necessary statistics, and Mid-query
Re-optimization includes many heuristics to deal with this. For example, a
statistics collection routine can compute average tuple size without
interrupting the normal execution of a query by examining tuples produced by a
SELECT operator and keeping a running count of the number of matching tuples.
This can be done during the same pass of the input data produced for output, ensuring
that the run-time statistics collection does not slow down the query execution.
Kabra and DeWitt’s experimental results have been
very promising. Although queries with simple joins (0-1 relations) show no
improvement in execution time, queries with joins of medium complexity (2-3
relations) show mild improvement and queries with complex joins (4+ relations)
show significant improvement.
The
techniques described in this paper represent a major step forward in the
evolution of query processing. Adaptive query processing can be very powerful
in distributed systems where its techniques are required, and this power also
shows promising results in non-distributed systems. As with other heuristic
techniques, there is a great balance between performance gains and the cost of doing
additional processing, statistics maintenance, query recompilation, and the
like. As time goes by, research will no doubt uncover additional methods that
further increase the efficiency of query processing, yielding faster query
execution time and more optimized execution plans.
(1) A. Gounaris, N. W. Paton, A. Fernandes, and R.
Sakellariou. Adaptive Query Processing: A Survey. In 19th British
National Conference on Databases, pages 11-25,
This paper presents an overview of the history and
recent developments of adaptive query processing. It includes a brief
description of several techniques, the power they provide, and their adaptive
abilities.
(2) J. M. Hellerstein, M. J. Franklin, S.
Chandrasekaran, A. Deshpande, K. Hildrum, S. Madden, V. Raman, and M. Shah.
Adaptive query processing: Technology in evolution. IEEE Data Engineering
Bulletin, 23(2):7-18, 2000.
This paper describes the Telegraph project
developed at the
(3) Z. Ives, A. Levy, D. Weld, D. Florescu, and M.
Friedman. Adaptive query processing for internet applications. IEEE Data
Engineering Bulletin, 23(2):19-26, 2000.
This paper describes adaptive query optimization as
it relates to the benefits and fallacies of a wide-area system using the
Internet. Specifically, it analyzes the need to adjust processing based on I/O
delays and data flow rates, sharing and re-use of data across multiple queries,
and the ability to output partial, incremental results.
(4) G. Graefe. Dynamic Query Evaluation Plans: Some
Course Corrections? IEEE Data Engineering Bulletin, 23(2):4-7, 2000.
This paper seeks to realign the focus of adaptive
query processing to include query execution and physical database design. The
author believes that by tying the three together, a broader understanding of
the issues involved in adaptive optimizing will be understood.
(5) R. Avnur and M. Thomas. Continuously
Re-optimizing Query Processor. Technical Report,
This paper describes a query processor that does
not rely on statistics gathering, but instead adapts to the performance of the
query operators. The study focused on select operators and pipelined
index-joins and hash ripple-joins.
(6) R. Avnur and J. Hellerstein. Eddies:
Continuously adaptive query processing. Proceedings of the ACM SIGMOD, 2000,
pages 261-272, 2000.
The authors present a learning algorithm that
controls adaptations to run-time changes, including cost and selectivity of
operators, in the execution environment.
(7) N. Kabra and D. J. DeWitt. Efficient Mid-Query
Re-Optimization of Sub-Optimal Query Execution Plans. Proceedings of the ACM
SIGMOD International Conference on Management of Data, Vol. 27, pages
106-117, ACM Press, NY, 1998.
The authors describe a system that detects
sub-optimal query plans and attempts to correct them during execution. In this
way, statistics collected at key points during execution are used to optimize;
careful consideration of when and how to collect these statistics is necessary.
(8) K. Ng, Z. Wang, R. Muntz, and S. Nittel.
Dynamic query re-optimization. Proceedings of 11th International
Conference on Statistical and Scientific Database Management, pages
264-273, IEEE Computer Society, 1999.
This paper analyzes the situation in which system
configuration and resource availability change during the execution of long
running queries. Such queries are therefore executed using sub-optimal plans;
this paper describes an approach to re-optimize while considering
reconfiguration cost as well as execution cost.
(9) K. Ng, Z. Wang, and R. Muntz. Dynamic reconfiguration of sub-optimal parallel query
execution plans. Technical Report, UCLA, 1998.
The authors propose an algorithm to coordinate the
steps in a dynamic reconfiguration of query execution plans. Additionally, the
authors suggest a syntactic extension to SQL to allow users to specify if and
how query plans should be reconfigured for each query.
(10) L. Bouganim, F. Fabret, C. Mohan, and P.
Valduriez. Dynamic query scheduling in data integration systems. Proceedings
of ICDE 2000, pages 425-434, 2000.
This paper focuses on the problem of an
unpredictable data rate in coordination with query scheduling. The approach
performs a step-by-step scheduling of several query fragments, which are then
processed based on data arrival. This yields significant performance gains.
(11) T. Urhan and M. Franklin. XJoin: A
reactively-scheduled pipelined join operator. IEEE Data Engineering Bulletin, pages 27-33, 2000.
This paper describes the design of XJoin, an
operator capable of providing responsive query processing when accessing data
from widely distributed sources. XJoin functions in a pipelined fashion, so
that several concurrent tasks are each executing at different stages as
appropriate.
(12) S. R. Madden, M. A. Shah, J. M. Hellerstein,
and V. Raman. Continuously Adaptive Continuous Queries over Streams. Proceedings of the ACM SIGMOD International
Conference on Management of Data, 2002.
The authors describe adaptive solutions to
continuous queries – queries that run over an infinite, always changing data
source. Because of their nature, such unbounded queries will run long enough to
experience changes in system and data properties and adaptive mechanisms are
necessary for optimal query execution.
(13) M. Shah, J. Hellerstein, S. Chandrasekaran,
and M. Franklin. Flux: An Adaptive Partitioning Operator for Continuous Query
Systems. Technical Report,
This paper introduces a dataflow operator called
Flux (Fault-tolerant Load-balancing eXchange), which repartitions lookup-based operators
while the pipeline is still executing. The mechanism provides great
improvements in throughput and average latency for continuous queries.
(14) Z. Ives, D. Florescu, M. Friedman, A. Levy,
and D. S. Weld. An Adaptive Query Execution System for Data Integration. ACM
SIGMOD Conference,
This paper presents the Tukwila data integration
system, which uses adaptive-execution features to improve performance. The
system was designed to overcome three inherent problems with traditional
database query execution techniques when executing queries on remote sources:
the absence of quality statistics, unpredictable and bursty data arrival rates,
and frequent redundancy of the data sources.
(15) V. Zadorozhny, M.E. Vidal, L. Raschid, T.
Urhan, and L. Bright. Efficient
evaluation of queries in a mediator for websources. Under review, 1999.
This paper describes a two-phase Web Query
Optimizer that efficiently searches a large space of plans at run-time to
obtain a low cost plan (in comparison to a traditional optimizer).
(16) T. Urhan and M. J. Franklin. XJoin: Getting
fast answers from slow and bursty networks. Technical Report,
This paper is the initial technical report on
XJoin, and it describes much of the background in the authors’ search for
techniques to allow querying through heterogeneous and semi-structured
databases from sources distributed around the world.
(17) J. Forrester and J. Ledlie. XJoin and the
Benefits of Free Work. Technical Report,
The authors report on their experience implementing
the XJoin algorithm. Their implementation performed well, and the paper
includes the source code they created.
(18) A.Y. Levy. Combining artificial intelligence
and databases for data integration. Technical Report,
This paper describes several adaptive AI techniques
to solve problems of integrating data from a multitude of independent,
autonomous data sources.
(19) L. Amsaleg, M. J. .Franklin, and A. Tomasic. Dynamic Query Operator Scheduling for
Wide-Area Remote Access. Journal of Distributed and Parallel Databases,
Vol. 6, No. 3, 1998.
The authors describe a technique they devised
called Query Scrambling, which dynamically modifies query execution plans
on-the-fly in reaction to unexpected delays in data access (when accessing
remote sources). They focus on the dynamic scheduling of query operators and
simulate the effects of the scheduler.
(20) H. Paques, L. Liu, and C. Pu. Distributed
Query Adaptation and Its Trade-offs. Technical Report, Georgia Institute of
Technology, 2002.
The authors describe an overview of Ginga, an adaptive
query processing engine that combines proactive (compile-time) alternative
query plan generation with reactive (run-time) monitoring of network delays.
(21) H. Paques, L. Liu, and C. Pu. Ginga: A
SelfAdaptive Query Processing System. Technical Report, Georgia Institute of
Technology, 2003.
The authors describe Ginga in greater detail,
including its architecture and the various phases the system encompasses
throughout its life cycle.
(22) C. Shahabi, L. Khan, D. Mcleod and V. Shah.
Run-Time Optimization of Join Queries for Distributed Databases over the
Internet. Proceedings of Communication
Networks and Distributed Systems Modeling and Simulation, 1999.
The authors devise a run-time, probing mechanism
for use in JOIN queries that produces optimal query execution plans in
distributed systems.
(23) S. Chaudhuri. An Overview of Query
Optimization in Relational Systems. Proceedings
of the ACM PODS, pages 34-43, 1998.
An overview of the general issues in query
optimization are described.
(24) L. Amsaleg, M. Franklin, and A. Tomasic. Query
Scrambling for Bursty Data Arrival. Technical Report,
The Query Scrambling technique is described in
detail.
(25) R. Elmasri and S. Navathe. Fundamentals of
Database Systems, Fourth Edition, 2003.
Our course textbook which provides an overview of
all functions within a DBMS.