Don’t get fooled by the Statistic Estimations

3
564

 

Biased value histogram entries store exact cardinalities. Statistics which are up to date should reflect the number of rows returned in the retrieve step of a biased value.

In different words: The Optimizer can easily determine the number of rows being retrieved by looking up the biases value in the histogram.

Unfortunately,  this is not always the case.

I created a test scenario, populating a test table with 50.000 rows containing  100 distinct dates:

CREATE MULTISET TABLE Statistics_Estimation1
(
TheDate DATE FORMAT ‘YYYY-MM-DD’ NOT NULL,
) PRIMARY INDEX (TheDate);
INSERT INTO Statistics_Estimation1
SELECT DATE’2010-01-01′ + (ROW_NUMBER() OVER (ORDER BY 1) MOD 100) AS TheDate,
FROM SYS_CALENDAR.CALENDAR
SAMPLE 50000;

In order to have fresh statistics index I executed the following statement. The system can pack all 50.000 rows into 100 histograms, all of them will be stored as biased histogram values.

This is an important need in our test setup, as it ensures that we can expect that the estimated number of rows matches always 100% the retrieved number of rows:

COLLECT STATS ON Statistics_Estimation1 COLUMN(TheDate);

We can now check the histograms, by issuing below statement:

SHOW  STATISTICS VALUES ON Statistics_Estimation1 ;

/** Biased: Value, Frequency **/
/* 1 */ DATE ‘2010-01-01’, 515,
/* 2 */ DATE ‘2010-01-02’, 498,
/* 3 */ DATE ‘2010-01-03’, 510,
/* 4 */ DATE ‘2010-01-04’, 482,
/* 5 */ DATE ‘2010-01-05’, 490,
/* 6 */ DATE ‘2010-01-06’, 518,
/* 7 */ DATE ‘2010-01-07’, 510,
/* 8 */ DATE ‘2010-01-08’, 512,
/* 9 */ DATE ‘2010-01-09’, 493,
/* 10 */ DATE ‘2010-01-10’, 513,
/* 11 */ DATE ‘2010-01-11’, 498,
/* 12 */ DATE ‘2010-01-12’, 513,
/* 13 */ DATE ‘2010-01-13’, 509,
/* 14 */ DATE ‘2010-01-14’, 500,
/* 15 */ DATE ‘2010-01-15’, 491,
/* 16 */ DATE ‘2010-01-16’, 489,
/* 17 */ DATE ‘2010-01-17’, 481,
/* 18 */ DATE ‘2010-01-18’, 508,
/* 19 */ DATE ‘2010-01-19’, 496,
/* 20 */ DATE ‘2010-01-20’, 505,
/* 21 */ DATE ‘2010-01-21’, 498,
/* 22 */ DATE ‘2010-01-22’, 496,
/* 23 */ DATE ‘2010-01-23’, 505,
/* 24 */ DATE ‘2010-01-24’, 504,
/* 25 */ DATE ‘2010-01-25’, 485,
/* 26 */ DATE ‘2010-01-26’, 508,
/* 27 */ DATE ‘2010-01-27’, 494,
/* 28 */ DATE ‘2010-01-28’, 507,
/* 29 */ DATE ‘2010-01-29’, 495,
/* 30 */ DATE ‘2010-01-30’, 501,
/* 31 */ DATE ‘2010-01-31’, 486,
/* 32 */ DATE ‘2010-02-01’, 517,
/* 33 */ DATE ‘2010-02-02’, 501,
/* 34 */ DATE ‘2010-02-03’, 505,
/* 35 */ DATE ‘2010-02-04’, 496,
/* 36 */ DATE ‘2010-02-05’, 514,
/* 37 */ DATE ‘2010-02-06’, 492,
/* 38 */ DATE ‘2010-02-07’, 496,
/* 39 */ DATE ‘2010-02-08’, 504,
/* 40 */ DATE ‘2010-02-09’, 486,
/* 41 */ DATE ‘2010-02-10’, 509,
/* 42 */ DATE ‘2010-02-11’, 490,
/* 43 */ DATE ‘2010-02-12’, 499,
/* 44 */ DATE ‘2010-02-13’, 487,
/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,
/* 60 */ DATE ‘2010-03-01’, 518,
/* 61 */ DATE ‘2010-03-02’, 491,
/* 62 */ DATE ‘2010-03-03’, 484,
/* 63 */ DATE ‘2010-03-04’, 510,
/* 64 */ DATE ‘2010-03-05’, 511,
/* 65 */ DATE ‘2010-03-06’, 494,
/* 66 */ DATE ‘2010-03-07’, 493,
/* 67 */ DATE ‘2010-03-08’, 500,
/* 68 */ DATE ‘2010-03-09’, 531,
/* 69 */ DATE ‘2010-03-10’, 497,
/* 70 */ DATE ‘2010-03-11’, 512,
/* 71 */ DATE ‘2010-03-12’, 481,
/* 72 */ DATE ‘2010-03-13’, 500,
/* 73 */ DATE ‘2010-03-14’, 510,
/* 74 */ DATE ‘2010-03-15’, 487,
/* 75 */ DATE ‘2010-03-16’, 511,
/* 76 */ DATE ‘2010-03-17’, 515,
/* 77 */ DATE ‘2010-03-18’, 513,
/* 78 */ DATE ‘2010-03-19’, 517,
/* 79 */ DATE ‘2010-03-20’, 516,
/* 80 */ DATE ‘2010-03-21’, 516,
/* 81 */ DATE ‘2010-03-22’, 504,
/* 82 */ DATE ‘2010-03-23’, 493,
/* 83 */ DATE ‘2010-03-24’, 510,
/* 84 */ DATE ‘2010-03-25’, 486,
/* 85 */ DATE ‘2010-03-26’, 495,
/* 86 */ DATE ‘2010-03-27’, 497,
/* 87 */ DATE ‘2010-03-28’, 481,
/* 88 */ DATE ‘2010-03-29’, 501,
/* 89 */ DATE ‘2010-03-30’, 486,
/* 90 */ DATE ‘2010-03-31’, 512,
/* 91 */ DATE ‘2010-04-01’, 487,
/* 92 */ DATE ‘2010-04-02’, 501,
/* 93 */ DATE ‘2010-04-03’, 490,
/* 94 */ DATE ‘2010-04-04’, 508,
/* 95 */ DATE ‘2010-04-05’, 518,
/* 96 */ DATE ‘2010-04-06’, 478,
/* 97 */ DATE ‘2010-04-07’, 494,
/* 98 */ DATE ‘2010-04-08’, 503,
/* 99 */ DATE ‘2010-04-09’, 494,
/* 100 */ DATE ‘2010-04-10′, 512,

Now we execute the explain for below query:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-16’;

1) First, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-14′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 510 rows (15,810 bytes). The estimated time for
this step is 0.01 seconds.
2) Next, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-15′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,020 rows (31,620 bytes). The estimated time
for this step is 0.01 seconds.
3) We do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-16′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,530 rows (47,430 bytes). The estimated time
for this step is 0.01 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.02 seconds.

The Optimizer retrieves the requested primary index values via single-AMP primary index accesses, as the number of selected rows is small compared with the total number of table rows.

Did you recognize that the estimation is 1530 rows? Isn’t it strange? If we sum up the 3 histogram values for our 3 dates (2010-02-14,2010-02-15 and 2010-02-16) we can see that it should be 1528 rows!

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,

513+517+498 = 1528

Furthermore, the Optimizer estimates 510 rows for each single-AMP access…did we detect a bug in the Optimizer?

I would say no.

What happens is that some internal optimization takes place, which might show slightly wrong estimation but makes the estimation process faster. The Optimizer simply adds up the values of all histograms, and divides the sum by the number of histograms. In our example:

1528/3 = 509,33

The ceiling of the resulting number is used to distribute the estimation evenly across all single-AMP estimations i.e. 510 – 510 – 510
As a result, we end up with an estimation which is slightly wrong.

We can easily prove this, by increasing the number of retrieved rows, forcing the Optimizer to do a full table scan instead of single-AMP lookups:

SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

By extending the date range to 2010-02-28, the plan changes:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

1) First, we lock DWHPRO.Statistics_Estimation1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Statistics_Estimation1 for read.
3) We do an all-AMPs RETRIEVE step from DWHPRO.Statistics_Estimation1
by way of an all-rows scan with a condition of (
“(DWHPRO.Statistics_Estimation1.TheDate >= DATE ‘2010-02-14’) AND
(DWHPRO.Statistics_Estimation1.TheDate <= DATE ‘2010-02-28’)”)
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be 7,438
rows (230,578 bytes). The estimated time for this step is 0.11
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.11 seconds.

As you can easily prove by summing up the histogram values below, the estimation of 7,438 now matches exactly the number of rows being retrieved.

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,

Whenever reading an execution plan, try to figure out how the estimations are built. Over time, you will become better and better in understanding estimations and will be able to solve statistic problems.

See also:
Teradata Statistics – A Case Study
Teradata Statistics Extrapolation for Equality Predicates

Our Reader Score
[Total: 2    Average: 5/5]
Don’t get fooled by the Statistic Estimations written by Roland Wenzlofsky average rating 5/5 - 2 user ratings

3 COMMENTS

  1. Thanks Ronald for wonderful article.

    It may sound silly but i didn’t get why taking avg of histogram values is considered better by optimizer than using the value itself.

    /* 45 */ DATE ‘2010-02-14’, 513,
    /* 46 */ DATE ‘2010-02-15’, 517,
    /* 47 */ DATE ‘2010-02-16’, 498,

    i mean taking exact values of histograms should be better than taking the values ,dividing it by no of histograms , rounding off to nearest decimal and then multiply by no of histograms used.

    It would have make more sense if it has used avg rows per histogram( in this case 50000/100=500)* no histogram values being retrieved (500*3=1500) which is clearly not the case.

    Cheers !!

LEAVE A REPLY

Please enter your comment!
Please enter your name here