Chapter 13: Query Processing

n Overview

n Measures of Query Cost

n Selection Operation

n Sorting

n Join Operation

n Other Operations

n Evaluation of Expressions

Basic Steps in Query Processing

1. Parsing and translation

2. Optimization

3. Evaluation

n Parsing and translation

l translate the query into its internal form. This is then translated into relational algebra.

l Parser checks syntax, verifies relations

n Evaluation

l The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.

Basic Steps in Query Processing : Optimization

n A relational algebra expression may have many equivalent expressions

l E.g., s_balance_<₂₅₀₀(Õ_balance(account)) is equivalent to
Õ_balance(s_balance_<₂₅₀₀(account))

n Each relational algebra operation can be evaluated using one of several different algorithms

l Correspondingly, a relational-algebra expression can be evaluated in many ways.

n Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.

l E.g., can use an index on balance to find accounts with balance < 2500,

l or can perform complete relation scan and discard accounts with balance ³ 2500

n Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost.

l Cost is estimated using statistical information from the
database catalog

4 e.g. number of tuples in each relation, size of tuples, etc.

n In this chapter we study

l How to measure query costs

l Algorithms for evaluating relational algebra operations

l How to combine algorithms for individual operations in order to evaluate a complete expression

n In Chapter 14

l We study how to optimize queries, that is, how to find an evaluation plan with lowest estimated cost

Measures of Query Cost

n Cost is generally measured as total elapsed time for answering query

l Many factors contribute to time cost

4 disk accesses, CPU, or even network communication

n Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account

l Number of seeks * average-seek-cost

l Number of blocks read * average-block-read-cost

l Number of blocks written * average-block-write-cost

4 Cost to write a block is greater than cost to read a block

– data is read back after being written to ensure that the write was successful

n For simplicity we just use the number of block transfers from disk and the number of seeks as the cost measures

l t_T – time to transfer one block

l t_S – time for one seek

l Cost for b block transfers plus S seeks
b t_T + S * t_S*

n We ignore CPU costs for simplicity

l Real systems do take CPU cost into account

n We do not include cost to writing output to disk in our cost formulae

n Several algorithms can reduce disk IO by using extra buffer space

l Amount of real memory available to buffer depends on other concurrent queries and OS processes, known only during execution

4 We often use worst case estimates, assuming only the minimum amount of memory needed for the operation is available

n Required data may be buffer resident already, avoiding disk I/O

l But hard to take into account for cost estimation

Selection Operation

n File scan – search algorithms that locate and retrieve records that fulfill a selection condition.

n Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition.

l Cost estimate = b_rblock transfers + 1 seek

4 b_r denotes number of blocks containing records from relation r

l If selection is on a key attribute, can stop on finding record

4 cost = (b_r/2) block transfers + 1 seek

l Linear search can be applied regardless of

4 selection condition or

4 ordering of records in the file, or

4 availability of indices

n A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered.

l Assume that the blocks of a relation are stored contiguously

l Cost estimate (number of disk blocks to be scanned):

4 cost of locating the first tuple by a binary search on the blocks

– élog₂(b_r)ù * (t_T + t_S)

4 If there are multiple records satisfying selection

– Add transfer cost of the number of blocks containing records that satisfy selection condition

– Will see how to estimate this cost in Chapter 14

Selections Using Indices

n Index scan – search algorithms that use an index

l selection condition must be on search-key of index.

n A3 (primary index on candidate key, equality). Retrieve a single record that satisfies the corresponding equality condition

l Cost = (h_i + 1) * (t_T + t_S)

n A4 (primary index on nonkey, equality) Retrieve multiple records.

l Records will be on consecutive blocks

4 Let b = number of blocks containing matching records

l Cost = h_i * (t_T + t_S) + t_S + t_T * b

n A5 (equality on search-key of secondary index).

l Retrieve a single record if the search-key is a candidate key

4 Cost = (h_i + 1) * (t_T + t_S)

l Retrieve multiple records if search-key is not a candidate key

4 each of n matching records may be on a different block

4 Cost = (h_i + n) * (t_T + t_S)

– Can be very expensive!

Selections Involving Comparisons

n Can implement selections of the form s_A_£_V(r) or s_A_³_V(r) by using

l a linear file scan or binary search,

l or by using indices in the following ways:

n A6 (primary index, comparison). (Relation is sorted on A)

4 For s_A_³_V(r) use index to find first tuple ³ v and scan relation sequentially from there

4 For s_A_£_V(r) just scan relation sequentially till first tuple > v; do not use index

n A7 (secondary index, comparison).

4 For s_A_³_V(r) use index to find first index entry ³ v and scan index sequentially from there, to find pointers to records.

4 For s_A_£_V(r) just scan leaf pages of index finding pointers to records, till first entry > v

4 In either case, retrieve records that are pointed to

– requires an I/O for each record

– Linear file scan may be cheaper

Implementation of Complex Selections

n Conjunction: s_q₁Ù _q₂Ù. . . _q_n(r)

n A8 (conjunctive selection using one index).

l Select a combination of q_i and algorithms A1 through A7 that results in the least cost for s_q_i (r).

l Test other conditions on tuple after fetching it into memory buffer.

n A9 (conjunctive selection using multiple-key index).

l Use appropriate composite (multiple-key) index if available.

n A10 (conjunctive selection by intersection of identifiers).

l Requires indices with record pointers.

l Use corresponding index for each condition, and take intersection of all the obtained sets of record pointers.

l Then fetch records from file

l If some conditions do not have appropriate indices, apply test in memory.

Algorithms for Complex Selections

n Disjunction:s_q₁Ú _q₂Ú. . . _q_n(r).

n A11 (disjunctive selection by union of identifiers).

l Applicable if all conditions have available indices.

4 Otherwise use linear scan.

l Use corresponding index for each condition, and take union of all the obtained sets of record pointers.

l Then fetch records from file

n Negation: s_Ø_q(r)

l Use linear scan on file

l If very few records satisfy Øq, and an index is applicable to q

4 Find satisfying records using index and fetch from file

Sorting

n We may build an index on the relation, and then use the index to read the relation in sorted order. May lead to one disk block access for each tuple.

n For relations that fit in memory, techniques like quicksort can be used. For relations that don’t fit in memory, external
sort-merge is a good choice.

External Sort-Merge

n Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:
     (a) Read M blocks of relation into memory
     (b) Sort the in-memory blocks
     (c) Write sorted data to run R_i; increment i.
Let the final value of i be N

n Merge the runs (next slide)…..

n Merge the runs (N-way merge). We assume (for now) that N < M.

n Use N blocks of memory to buffer input runs, and 1 block to buffer output. Read the first block of each run into its buffer page

n repeat

n Select the first record (in sort order) among all buffer pages

n Write the record to the output buffer. If the output buffer is full write it to disk.

n Delete the record from its input buffer page.
If the buffer page becomes empty then
read the next block (if any) of the run into the buffer.

n until all input buffer pages are empty:

n If N ³ M, several merge passes are required.

l In each pass, contiguous groups of M - 1 runs are merged.

l A pass reduces the number of runs by a factor of M -1, and creates runs longer by the same factor.

4 E.g. If M=11, and there are 90 runs, one pass reduces the number of runs to 9, each 10 times the size of the initial runs

l Repeated passes are performed till all runs have been merged into one.

Example: External Sorting Using Sort-Merge

n Cost analysis:

l Total number of merge passes required: élog_M_–1(b_r/M)ù.

l Block transfers for initial run creation as well as in each pass is 2b_r

4 for final pass, we don’t count write cost

– we ignore final write cost for all operations since the output of an operation may be sent to the parent operation without being written to disk

4 Thus total number of block transfers for external sorting:
b_r( 2 élog_M_–1(b_r/ M)ù + 1)

l Seeks: next slide

n Cost of seeks

l During run generation: one seek to read each run and one seek to write each run

4 2 éb_r/ Mù

l During the merge phase

4 Buffer size: b_b(read/write b_b blocks at a time)

4 Need 2 éb_r/ b_bù seeks for each merge pass

– except the final one which does not require a write

4 Total number of seeks:
2 éb_r/ Mù + éb_r/ b_bù (2 élog_M_–1(b_r/ M)ù -1)

Join Operation

n Several different algorithms to implement joins

l Nested-loop join

l Block nested-loop join

l Indexed nested-loop join

l Merge-join

l Hash-join

n Choice based on cost estimate

n Examples use the following information

l Number of records of customer: 10,000 depositor: 5000

l Number of blocks of customer: 400 depositor: 100

Nested-Loop Join

n To compute the theta join         r     _q s
for each tuple t_r in r do begin
      for each tuple t_s in s do begin
                test pair (t_r,t_s) to see if they satisfy the join condition q
                if they do, add t_r • t_s to the result.
      end
end

n r is called the outer relation and s the inner relation of the join.

n Requires no indices and can be used with any kind of join condition.

n Expensive since it examines every pair of tuples in the two relations.

n In the worst case, if there is enough memory only to hold one block of each relation, the estimated cost is
n_r * b_s + b_r
block transfers, plus
n_r + b_r
seeks

n If the smaller relation fits entirely in memory, use that as the inner relation.

l Reduces cost to b_r + b_s block transfers and 2 seeks

n Assuming worst case memory availability cost estimate is

l with depositor as outer relation:

4 5000 * 400 + 100 = 2,000,100 block transfers,

4 5000 + 100 = 5100 seeks

l with customer as the outer relation

4 10000 * 100 + 400 = 1,000,400 block transfers and 10,400 seeks

n If smaller relation (depositor) fits entirely in memory, the cost estimate will be 500 block transfers.

n Block nested-loops algorithm (next slide) is preferable.

Block Nested-Loop Join

n Variant of nested-loop join in which every block of inner relation is paired with every block of outer relation.

          for each block B_r of r do begin
                for each block B_s of s do begin
                          for each tuple t_r in B_rdo begin
                                   for each tuple t_s in B_s do begin
                                             Check if (t_r,t_s) satisfy the join condition
                                             if they do, add t_r• t_s to the result.
                                   end
                         end
                end
      end

n Worst case estimate: b_r * b_s + b_r block transfers + 2 * b_r seeks

l Each block in the inner relation s is read once for each block in the outer relation (instead of once for each tuple in the outer relation

n Best case: b_r + b_s block transfers + 2 seeks.

n Improvements to nested loop and block nested loop algorithms:

l In block nested-loop, use M — 2 disk blocks as blocking unit for outer relations, where M = memory size in blocks; use remaining two blocks to buffer inner relation and output

4 Cost = éb_r/ (M-2)ù * b_s + b_r block transfers +
2 éb_r/ (M-2)ù seeks

l If equi-join attribute forms a key or inner relation, stop inner loop on first match

l Scan inner loop forward and backward alternately, to make use of the blocks remaining in buffer (with LRU replacement)

l Use index on inner relation if available (next slide)

Indexed Nested-Loop Join

n Index lookups can replace file scans if

l join is an equi-join or natural join and

l an index is available on the inner relation’s join attribute

4 Can construct an index just to compute a join.

n For each tuple t_r in the outer relation r, use the index to look up tuples in s that satisfy the join condition with tuple t_r.

n Worst case: buffer has space for only one page of r, and, for each tuple in r, we perform an index lookup on s.

n Cost of the join: b_r (t_T+ t_S) + n_r * c

l Where c is the cost of traversing index and fetching all matching s tuples for one tuple or r

l c can be estimated as cost of a single selection on s using the join condition.

n If indices are available on join attributes of both r and s,
use the relation with fewer tuples as the outer relation.

Example of Nested-Loop Join Costs

n Compute depositor customer, with depositor as the outer relation.

n Let customer have a primary B⁺-tree index on the join attribute customer-name, which contains 20 entries in each index node.

n Since customer has 10,000 tuples, the height of the tree is 4, and one more access is needed to find the actual data

n depositor has 5000 tuples

n Cost of block nested loops join

l 400100 + 100 = 40,100 block transfers + 2 100 = 200 seeks

4 assuming worst case memory

4 may be significantly less with more memory

n Cost of indexed nested loops join

l 100 + 5000 * 5 = 25,100 block transfers and seeks.

l CPU cost likely to be less than that for block nested loops join

Merge-Join

n Sort both relations on their join attribute (if not already sorted on the join attributes).

n Merge the sorted relations to join them

n Join step is similar to the merge stage of the sort-merge algorithm.

n Main difference is handling of duplicate values in join attribute — every pair with same value on join attribute must be matched

n Detailed algorithm in book

n Can be used only for equi-joins and natural joins

n Each block needs to be read only once (assuming all tuples for any given value of the join attributes fit in memory

n Thus the cost of merge join is:
b_r + b_s block transfers + éb_r/ b_bù + éb_s/ b_bù seeks

l + the cost of sorting if relations are unsorted.

n hybrid merge-join: If one relation is sorted, and the other has a secondary B⁺-tree index on the join attribute

l Merge the sorted relation with the leaf entries of the B⁺-tree .

l Sort the result on the addresses of the unsorted relation’s tuples

l Scan the unsorted relation in physical address order and merge with previous result, to replace addresses by the actual tuples

4 Sequential scan more efficient than random lookup

Hash-Join

n Applicable for equi-joins and natural joins.

n A hash function h is used to partition tuples of both relations

n h maps JoinAttrs values to {0, 1, ..., n}, where JoinAttrs denotes the common attributes of r and s used in the natural join.

l r₀, r₁, . . ., r_n denote partitions of r tuples

4 Each tuple t_r Î r is put in partition r_i where i = h(t_r[JoinAttrs]).

l r₀,, r₁. . ., r_n denotes partitions of s tuples

4 Each tuple t_s Îs is put in partition s_i, where i = h(t_s[JoinAttrs]).

n Note: In book, r_iis denoted as H_ri,s_iis denoted as H_s_iand
nis denoted as n_h.

n r tuples in r_i need only to be compared with s tuples in s_i Need not be compared with s tuples in any other partition, since:

l an r tuple and an s tuple that satisfy the join condition will have the same value for the join attributes.

l If that value is hashed to some value i, the r tuple has to be in r_i and the s tuple in s_i.

Hash-Join Algorithm

1. Partition the relation s using hashing function h. When partitioning a relation, one block of memory is reserved as the output buffer for each partition.

2. Partition r similarly.

3. For each i:

(a) Load s_i into memory and build an in-memory hash index on it using the join attribute. This hash index uses a different hash function than the earlier one h.

(b) Read the tuples in r_i from the disk one by one. For each tuple t_r locate each matching tuple t_s in s_i using the in-memory hash index. Output the concatenation of their attributes.

n The value n and the hash function h is chosen such that each s_i should fit in memory.

l Typically n is chosen as éb_s/Mù * f where f is a “fudge factor”, typically around 1.2

l The probe relation partitions s_i need not fit in memory

n Recursive partitioning required if number of partitions n is greater than number of pages M of memory.

l instead of partitioning n ways, use M – 1 partitions for s

l Further partition the M – 1 partitions using a different hash function

l Use same partitioning method on r

l Rarely required: e.g., recursive partitioning not needed for relations of 1GB or less with memory size of 2MB, with block size of 4KB.

Handling of Overflows

n Partitioning is said to be skewed if some partitions have significantly more tuples than some others

n Hash-table overflow occurs in partition s_i if s_i does not fit in memory. Reasons could be

l Many tuples in s with same value for join attributes

l Bad hash function

n Overflow resolution can be done in build phase

l Partition s_i is further partitioned using different hash function.

l Partition r_i must be similarly partitioned.

n Overflow avoidance performs partitioning carefully to avoid overflows during build phase

l E.g. partition build relation into many partitions, then combine them

n Both approaches fail with large numbers of duplicates

l Fallback option: use block nested loops join on overflowed partitions

Cost of Hash-Join

n If recursive partitioning is not required: cost of hash join is
3(b_r + b_s) +4 * n_hblock transfers +
2( éb_r/ b_bù + éb_s/ b_bù) seeks

n If recursive partitioning required:

l number of passes required for partitioning build relation
s is élog_M–₁(b_s) – 1ù

l best to choose the smaller relation as the build relation.

l Total cost estimate is:
2(b_r + b_sélog_M–₁(b_s) – 1ù + b_r + b_sblock transfers +
2(éb_r/ b_bù + éb_s/ b_bù) élog_M–₁(b_s) – 1ù seeks

n If the entire build input can be kept in main memory no partitioning is required

l Cost estimate goes down to b_r + b_s.

Example of Cost of Hash-Join

n Assume that memory size is 20 blocks

n b_depositor= 100 and b_customer = 400.

n depositor is to be used as build input. Partition it into five partitions, each of size 20 blocks. This partitioning can be done in one pass.

n Similarly, partition customer into five partitions,each of size 80. This is also done in one pass.

n Therefore total cost, ignoring cost of writing partially filled blocks:

l 3(100 + 400) = 1500 block transfers +
2( é100/3ù + é400/3ù) = 336 seeks

Hybrid Hash–Join

n Useful when memory sized are relatively large, and the build input is bigger than memory.

n Main feature of hybrid hash join:

Keep the first partition of the build relation in memory.

n E.g. With memory size of 25 blocks, depositor can be partitioned into five partitions, each of size 20 blocks.

l Division of memory:

4 The first partition occupies 20 blocks of memory

4 1 block is used for input, and 1 block each for buffering the other 4 partitions.

n customer is similarly partitioned into five partitions each of size 80

l the first is used right away for probing, instead of being written out

n Cost of 3(80 + 320) + 20 +80 = 1300 block transfers for
hybrid hash join, instead of 1500 with plain hash-join.

n Hybrid hash-join most useful if M >>

Complex Joins

n Join with a conjunctive condition:

r _q₁_Ù_q₂_Ù_..._Ù_q_n s

l Either use nested loops/block nested loops, or

l Compute the result of one of the simpler joins r _q_i s

4 final result comprises those tuples in the intermediate result that satisfy the remaining conditions

q₁Ù . . . Ù q_i_–1 Ù q_i₊₁ Ù . . . Ù q_n

n Join with a disjunctive condition

r _q₁_Ú_q₂_Ú_..._Ú_q_ns

l Either use nested loops/block nested loops, or

l Compute as the union of the records in individual joins r _q_i s:

(r _q₁s) È (r _q₂s) È . . . È (r _q_ns)

Other Operations

n Duplicate elimination can be implemented via hashing or sorting.

l On sorting duplicates will come adjacent to each other, and all but one set of duplicates can be deleted.

l Optimization: duplicates can be deleted during run generation as well as at intermediate merge steps in external sort-merge.

l Hashing is similar – duplicates will come into the same bucket.

n Projection:

l perform projection on each tuple

l followed by duplicate elimination.

Other Operations : Aggregation

n Aggregation can be implemented in a manner similar to duplicate elimination.

l Sorting or hashing can be used to bring tuples in the same group together, and then the aggregate functions can be applied on each group.

l Optimization: combine tuples in the same group during run generation and intermediate merges, by computing partial aggregate values

4 For count, min, max, sum: keep aggregate values on tuples found so far in the group.

– When combining partial aggregate for count, add up the aggregates

4 For avg, keep sum and count, and divide sum by count at the end

Other Operations : Set Operations

n Set operations (È, Ç and ¾): can either use variant of merge-join after sorting, or variant of hash-join.

n E.g., Set operations using hashing:

l Partition both relations using the same hash function

l Process each partition i as follows.

l Using a different hashing function, build an in-memory hash index on r_i.

l Process s_i as follows

l r È s:

l Add tuples in s_i to the hash index if they are not already in it.

l At end of s_i add the tuples in the hash index to the result.

l r Ç s:

l output tuples in s_i to the result if they are already there in the hash index

l r – s:

l for each tuple in s_i, if it is there in the hash index, delete it from the index.

l At end of s_i add remaining tuples in the hash index to the result.

Other Operations : Outer Join

n Outer join can be computed either as

l A join followed by addition of null-padded non-participating tuples.

l by modifying the join algorithms.

n Modifying merge join to compute r s

l In r s, non participating tuples are those in r – P_R(r s)

l Modify merge-join to compute r s: During merging, for every tuple t_r from r that do not match any tuple in s, output t_r padded with nulls.

l Right outer-join and full outer-join can be computed similarly.

n Modifying hash join to compute r s

l If r is probe relation, output non-matching r tuples padded with nulls

l If r is build relation, when probing keep track of which
r tuples matched s tuples. At end of s_i output
non-matched r tuples padded with nulls

Evaluation of Expressions

n So far: we have seen algorithms for individual operations

n Alternatives for evaluating an entire expression tree

l Materialization: generate results of an expression whose inputs are relations or are already computed, materialize (store) it on disk. Repeat.

l Pipelining: pass on tuples to parent operations even as an operation is being executed

n We study above alternatives in more detail

Materialization

n Materialized evaluation: evaluate one operation at a time, starting at the lowest-level. Use intermediate results materialized into temporary relations to evaluate next-level operations.

n E.g., in figure below, compute and store

then compute the store its join with customer, and finally compute the projections on customer-name.

n Materialized evaluation is always applicable

n Cost of writing results to disk and reading them back can be quite high

l Our cost formulas for operations ignore cost of writing results to disk, so

4 Overall cost = Sum of costs of individual operations +
cost of writing intermediate results to disk

n Double buffering: use two output buffers for each operation, when one is full write it to disk while the other is getting filled

l Allows overlap of disk writes with computation and reduces execution time

Pipelining

n Pipelined evaluation : evaluate several operations simultaneously, passing the results of one operation on to the next.

n E.g., in previous expression tree, don’t store result of

l instead, pass tuples directly to the join.. Similarly, don’t store result of join, pass tuples directly to projection.

n Much cheaper than materialization: no need to store a temporary relation to disk.

n Pipelining may not always be possible – e.g., sort, hash-join.

n For pipelining to be effective, use evaluation algorithms that generate output tuples even as tuples are received for inputs to the operation.

n Pipelines can be executed in two ways: demand driven and producer driven

n In demand driven or lazy evaluation

l system repeatedly requests next tuple from top level operation

l Each operation requests next tuple from children operations as required, in order to output its next tuple

l In between calls, operation has to maintain “state” so it knows what to return next

n In producer-driven or eager pipelining

l Operators produce tuples eagerly and pass them up to their parents

4 Buffer maintained between operators, child puts tuples in buffer, parent removes tuples from buffer

4 if buffer is full, child waits till there is space in the buffer, and then generates more tuples

l System schedules operations that have space in output buffer and can process more input tuples

n Alternative name: pull and push models of pipelining

n Implementation of demand-driven pipelining

l Each operation is implemented as an iterator implementing the following operations

4 open()

– E.g. file scan: initialize file scan

» state: pointer to beginning of file

– E.g.merge join: sort relations;

» state: pointers to beginning of sorted relations

4 next()

– E.g. for file scan: Output next tuple, and advance and store file pointer

– E.g. for merge join: continue with merge from earlier state till
next output tuple is found. Save pointers as iterator state.

4 close()

Evaluation Algorithms for Pipelining

n Some algorithms are not able to output results even as they get input tuples

l E.g. merge join, or hash join

l intermediate results written to disk and then read back

n Algorithm variants to generate (at least some) results on the fly, as input tuples are read in

l E.g. hybrid hash join generates output tuples even as probe relation tuples in the in-memory partition (partition 0) are read in

l Pipelined join technique: Hybrid hash join, modified to buffer partition 0 tuples of both relations in-memory, reading them as they become available, and output results of any matches between partition 0 tuples

4 When a new r₀ tuple is found, match it with existing s₀ tuples, output matches, and save it in r₀

4 Symmetrically for s₀ tuples

KRK Technical

Tuesday, December 4, 2012

Database Management System Chapter-13

Chapter 13: Query Processing

n Overview

n Measures of Query Cost

n Selection Operation

n Sorting

n Join Operation

n Other Operations

n Evaluation of Expressions

Basic Steps in Query Processing

1. Parsing and translation

2. Optimization

3. Evaluation

n Parsing and translation

l translate the query into its internal form. This is then translated into relational algebra.

l Parser checks syntax, verifies relations

n Evaluation

l The query-execution engine takes a query-evaluation plan, executes that plan, and returns the answers to the query.

Basic Steps in Query Processing : Optimization

n A relational algebra expression may have many equivalent expressions

l E.g., sbalance<2500(Õbalance(account)) is equivalent to Õbalance(sbalance<2500(account))

n Each relational algebra operation can be evaluated using one of several different algorithms

l Correspondingly, a relational-algebra expression can be evaluated in many ways.

n Annotated expression specifying detailed evaluation strategy is called an evaluation-plan.

l E.g., can use an index on balance to find accounts with balance < 2500,

l or can perform complete relation scan and discard accounts with balance ³ 2500

n Query Optimization: Amongst all equivalent evaluation plans choose the one with lowest cost.

l Cost is estimated using statistical information from the database catalog

4 e.g. number of tuples in each relation, size of tuples, etc.

n In this chapter we study

l How to measure query costs

l Algorithms for evaluating relational algebra operations

l How to combine algorithms for individual operations in order to evaluate a complete expression

n In Chapter 14

l We study how to optimize queries, that is, how to find an evaluation plan with lowest estimated cost

Measures of Query Cost

n Cost is generally measured as total elapsed time for answering query

l Many factors contribute to time cost

4 disk accesses, CPU, or even network communication

n Typically disk access is the predominant cost, and is also relatively easy to estimate. Measured by taking into account

l Number of seeks * average-seek-cost

l Number of blocks read * average-block-read-cost

l Number of blocks written * average-block-write-cost

4 Cost to write a block is greater than cost to read a block

– data is read back after being written to ensure that the write was successful

n For simplicity we just use the number of block transfers from disk and the number of seeks as the cost measures

l tT – time to transfer one block

l tS – time for one seek

l Cost for b block transfers plus S seeks b * tT + S * tS

n We ignore CPU costs for simplicity

l Real systems do take CPU cost into account

n We do not include cost to writing output to disk in our cost formulae

n Several algorithms can reduce disk IO by using extra buffer space

l Amount of real memory available to buffer depends on other concurrent queries and OS processes, known only during execution

4 We often use worst case estimates, assuming only the minimum amount of memory needed for the operation is available

n Required data may be buffer resident already, avoiding disk I/O

l But hard to take into account for cost estimation

Selection Operation

n File scan – search algorithms that locate and retrieve records that fulfill a selection condition.

n Algorithm A1 (linear search). Scan each file block and test all records to see whether they satisfy the selection condition.

l Cost estimate = br block transfers + 1 seek

4 br denotes number of blocks containing records from relation r

l If selection is on a key attribute, can stop on finding record

4 cost = (br /2) block transfers + 1 seek

l Linear search can be applied regardless of

4 selection condition or

4 ordering of records in the file, or

4 availability of indices

n A2 (binary search). Applicable if selection is an equality comparison on the attribute on which file is ordered.

l Assume that the blocks of a relation are stored contiguously

l Cost estimate (number of disk blocks to be scanned):

4 cost of locating the first tuple by a binary search on the blocks

– élog2(br)ù * (tT + tS)

4 If there are multiple records satisfying selection

– Add transfer cost of the number of blocks containing records that satisfy selection condition

– Will see how to estimate this cost in Chapter 14

Selections Using Indices

n Index scan – search algorithms that use an index

l E.g., s_balance_<₂₅₀₀(Õ_balance(account)) is equivalent to
Õ_balance(s_balance_<₂₅₀₀(account))

l Cost is estimated using statistical information from the
database catalog

l t_T – time to transfer one block

l t_S – time for one seek

l Cost for b block transfers plus S seeks
b t_T + S * t_S*

l Cost estimate = b_rblock transfers + 1 seek

4 b_r denotes number of blocks containing records from relation r

4 cost = (b_r/2) block transfers + 1 seek

– élog₂(b_r)ù * (t_T + t_S)

l Cost = (h_i + 1) * (t_T + t_S)

l Cost = h_i * (t_T + t_S) + t_S + t_T * b

4 Cost = (h_i + 1) * (t_T + t_S)

4 Cost = (h_i + n) * (t_T + t_S)

n Can implement selections of the form s_A_£_V(r) or s_A_³_V(r) by using

4 For s_A_³_V(r) use index to find first tuple ³ v and scan relation sequentially from there

4 For s_A_£_V(r) just scan relation sequentially till first tuple > v; do not use index

4 For s_A_³_V(r) use index to find first index entry ³ v and scan index sequentially from there, to find pointers to records.

4 For s_A_£_V(r) just scan leaf pages of index finding pointers to records, till first entry > v

n Conjunction: s_q₁Ù _q₂Ù. . . _q_n(r)

l Select a combination of q_i and algorithms A1 through A7 that results in the least cost for s_q_i (r).

n Disjunction:s_q₁Ú _q₂Ú. . . _q_n(r).

n Negation: s_Ø_q(r)

n For relations that fit in memory, techniques like quicksort can be used. For relations that don’t fit in memory, external
sort-merge is a good choice.

n Create sorted runs. Let i be 0 initially.
Repeatedly do the following till the end of the relation:
(a) Read M blocks of relation into memory
(b) Sort the in-memory blocks
(c) Write sorted data to run R_i; increment i.
Let the final value of i be N