Teradata Statistics: Estimates For Missing Values

Roland Wenzlofsky

December 21, 2016

minutes reading time


Biased value histogram entries store exact cardinalities. Statistics that are up to date should reflect the number of rows returned in the retrieving step of a biased value.

In different words: The Optimizer can quickly determine the number of rows being retrieved by looking up the histogram’s biases value.

Unfortunately,  this is not always the case.

I created a test scenario, populating a test table with 50.000 rows containing  100 distinct dates:

CREATE MULTISET TABLE Statistics_Estimation1
(
TheDate DATE FORMAT ‘YYYY-MM-DD’ NOT NULL,
) PRIMARY INDEX (TheDate);
INSERT INTO Statistics_Estimation1
SELECT DATE’2010-01-01′ + (ROW_NUMBER() OVER (ORDER BY 1) MOD 100) AS TheDate,
FROM SYS_CALENDAR.CALENDAR
SAMPLE 50000;

To have a fresh statistics index, I executed the following statement. The system can pack all 50.000 rows into 100 histograms; all will be stored as biased histogram values.

This is a critical need in our test setup, as it ensures that we can expect that the estimated number of rows always matches 100% of the retrieved number of rows:

COLLECT STATS ON Statistics_Estimation1 COLUMN(TheDate);

We can now check the histograms by issuing the below statement:

SHOW  STATISTICS VALUES ON Statistics_Estimation1 ;

/** Biased: Value, Frequency **/
/* 1 */ DATE ‘2010-01-01’, 515,
/* 2 */ DATE ‘2010-01-02’, 498,
/* 3 */ DATE ‘2010-01-03’, 510,
/* 4 */ DATE ‘2010-01-04’, 482,
/* 5 */ DATE ‘2010-01-05’, 490,
/* 6 */ DATE ‘2010-01-06’, 518,
/* 7 */ DATE ‘2010-01-07’, 510,
/* 8 */ DATE ‘2010-01-08’, 512,
/* 9 */ DATE ‘2010-01-09’, 493,
/* 10 */ DATE ‘2010-01-10’, 513,
/* 11 */ DATE ‘2010-01-11’, 498,
/* 12 */ DATE ‘2010-01-12’, 513,
/* 13 */ DATE ‘2010-01-13’, 509,
/* 14 */ DATE ‘2010-01-14’, 500,
/* 15 */ DATE ‘2010-01-15’, 491,
/* 16 */ DATE ‘2010-01-16’, 489,
/* 17 */ DATE ‘2010-01-17’, 481,
/* 18 */ DATE ‘2010-01-18’, 508,
/* 19 */ DATE ‘2010-01-19’, 496,
/* 20 */ DATE ‘2010-01-20’, 505,
/* 21 */ DATE ‘2010-01-21’, 498,
/* 22 */ DATE ‘2010-01-22’, 496,
/* 23 */ DATE ‘2010-01-23’, 505,
/* 24 */ DATE ‘2010-01-24’, 504,
/* 25 */ DATE ‘2010-01-25’, 485,
/* 26 */ DATE ‘2010-01-26’, 508,
/* 27 */ DATE ‘2010-01-27’, 494,
/* 28 */ DATE ‘2010-01-28’, 507,
/* 29 */ DATE ‘2010-01-29’, 495,
/* 30 */ DATE ‘2010-01-30’, 501,
/* 31 */ DATE ‘2010-01-31’, 486,
/* 32 */ DATE ‘2010-02-01’, 517,
/* 33 */ DATE ‘2010-02-02’, 501,
/* 34 */ DATE ‘2010-02-03’, 505,
/* 35 */ DATE ‘2010-02-04’, 496,
/* 36 */ DATE ‘2010-02-05’, 514,
/* 37 */ DATE ‘2010-02-06’, 492,
/* 38 */ DATE ‘2010-02-07’, 496,
/* 39 */ DATE ‘2010-02-08’, 504,
/* 40 */ DATE ‘2010-02-09’, 486,
/* 41 */ DATE ‘2010-02-10’, 509,
/* 42 */ DATE ‘2010-02-11’, 490,
/* 43 */ DATE ‘2010-02-12’, 499,
/* 44 */ DATE ‘2010-02-13’, 487,
/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,
/* 60 */ DATE ‘2010-03-01’, 518,
/* 61 */ DATE ‘2010-03-02’, 491,
/* 62 */ DATE ‘2010-03-03’, 484,
/* 63 */ DATE ‘2010-03-04’, 510,
/* 64 */ DATE ‘2010-03-05’, 511,
/* 65 */ DATE ‘2010-03-06’, 494,
/* 66 */ DATE ‘2010-03-07’, 493,
/* 67 */ DATE ‘2010-03-08’, 500,
/* 68 */ DATE ‘2010-03-09’, 531,
/* 69 */ DATE ‘2010-03-10’, 497,
/* 70 */ DATE ‘2010-03-11’, 512,
/* 71 */ DATE ‘2010-03-12’, 481,
/* 72 */ DATE ‘2010-03-13’, 500,
/* 73 */ DATE ‘2010-03-14’, 510,
/* 74 */ DATE ‘2010-03-15’, 487,
/* 75 */ DATE ‘2010-03-16’, 511,
/* 76 */ DATE ‘2010-03-17’, 515,
/* 77 */ DATE ‘2010-03-18’, 513,
/* 78 */ DATE ‘2010-03-19’, 517,
/* 79 */ DATE ‘2010-03-20’, 516,
/* 80 */ DATE ‘2010-03-21’, 516,
/* 81 */ DATE ‘2010-03-22’, 504,
/* 82 */ DATE ‘2010-03-23’, 493,
/* 83 */ DATE ‘2010-03-24’, 510,
/* 84 */ DATE ‘2010-03-25’, 486,
/* 85 */ DATE ‘2010-03-26’, 495,
/* 86 */ DATE ‘2010-03-27’, 497,
/* 87 */ DATE ‘2010-03-28’, 481,
/* 88 */ DATE ‘2010-03-29’, 501,
/* 89 */ DATE ‘2010-03-30’, 486,
/* 90 */ DATE ‘2010-03-31’, 512,
/* 91 */ DATE ‘2010-04-01’, 487,
/* 92 */ DATE ‘2010-04-02’, 501,
/* 93 */ DATE ‘2010-04-03’, 490,
/* 94 */ DATE ‘2010-04-04’, 508,
/* 95 */ DATE ‘2010-04-05’, 518,
/* 96 */ DATE ‘2010-04-06’, 478,
/* 97 */ DATE ‘2010-04-07’, 494,
/* 98 */ DATE ‘2010-04-08’, 503,
/* 99 */ DATE ‘2010-04-09’, 494,
/* 100 */ DATE ‘2010-04-10’, 512,

Now we execute the explain for the below query:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-16′;

1) First, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-14′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 510 rows (15,810 bytes). The estimated time for
this step is 0.01 seconds.
2) Next, we do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-15′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,020 rows (31,620 bytes). The estimated time
for this step is 0.01 seconds.
3) We do a single-AMP RETRIEVE step from
DWHPRO.Statistics_Estimation1 by way of the primary index
“DWHPRO.Statistics_Estimation1.TheDate = DATE ‘2010-02-16′” with
no residual conditions into Spool 1 (group_amps), which is built
locally on that AMP. The size of Spool 1 is estimated with high
confidence to be 1,530 rows (47,430 bytes). The estimated time
for this step is 0.01 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.02 seconds.

The Optimizer retrieves the requested primary index values via single-AMP primary index accesses, as the number of selected rows is small compared with the total number of table rows.

Did you recognize that the estimation is 1530 rows? Isn’t it strange? If we sum up the 3 histogram values for our 3 dates (2010-02-14,2010-02-15 and 2010-02-16), we can see that it should be 1528 rows!

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,

513+517+498 = 1528

Furthermore, the Optimizer estimates 510 rows for each single-AMP access…did we detect a bug in the Optimizer?

I would say no.

What happens is that some internal optimization takes place, which might show a slightly wrong estimation but makes the estimation process faster. The Optimizer adds up all histograms’ values and divides the sum by the number of histograms. In our example:

1528/3 = 509,33

The ceiling of the resulting number is used to distribute the estimation evenly across all single-AMP estimations, i.e., 510 – 510 – 510
As a result, we end up with a slightly wrong estimation.

We can easily prove this by increasing the number of retrieved rows, forcing the Optimizer to do a full table scan instead of single-AMP lookups:

SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

By extending the date range to 2010-02-28, the plan changes:

Explain SELECT TheDate FROM Statistics_Estimation1 WHERE TheDate BETWEEN DATE’2010-02-14′ AND DATE’2010-02-28′;

1) First, we lock DWHPRO.Statistics_Estimation1 for read on a
reserved RowHash to prevent global deadlock.
2) Next, we lock DWHPRO.Statistics_Estimation1 for read.
3) We do an all-AMPs RETRIEVE step from DWHPRO.Statistics_Estimation1
by way of an all-rows scan with a condition of (
“(DWHPRO.Statistics_Estimation1.TheDate >= DATE ‘2010-02-14’) AND
(DWHPRO.Statistics_Estimation1.TheDate <= DATE ‘2010-02-28’)”)
into Spool 1 (group_amps), which is built locally on the AMPs.
The size of Spool 1 is estimated with high confidence to be 7,438
rows (230,578 bytes). The estimated time for this step is 0.11
seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 0.11 seconds.

As you can easily prove by summing up the histogram values below, the estimation of 7,438 now matches exactly the number of rows being retrieved.

/* 45 */ DATE ‘2010-02-14’, 513,
/* 46 */ DATE ‘2010-02-15’, 517,
/* 47 */ DATE ‘2010-02-16’, 498,
/* 48 */ DATE ‘2010-02-17’, 494,
/* 49 */ DATE ‘2010-02-18’, 484,
/* 50 */ DATE ‘2010-02-19’, 500,
/* 51 */ DATE ‘2010-02-20’, 488,
/* 52 */ DATE ‘2010-02-21’, 483,
/* 53 */ DATE ‘2010-02-22’, 484,
/* 54 */ DATE ‘2010-02-23’, 497,
/* 55 */ DATE ‘2010-02-24’, 487,
/* 56 */ DATE ‘2010-02-25’, 493,
/* 57 */ DATE ‘2010-02-26’, 515,
/* 58 */ DATE ‘2010-02-27’, 479,
/* 59 */ DATE ‘2010-02-28’, 505,

Whenever reading an execution plan, try to figure out how the estimations are built. Over time, you will become better and better at understanding estimates and will be able to solve statistical problems.

See also:
Teradata Statistics – A Case Study
Teradata Statistics Extrapolation for Equality Predicates

  • Avatar
    rohit Khattri says:

    Thanks, Roland for a wonderful article.

    It may sound silly but I didn’t get why taking avg of histogram values is considered better by optimizer than using the value itself.

    /* 45 */ DATE ‘2010-02-14’, 513,
    /* 46 */ DATE ‘2010-02-15’, 517,
    /* 47 */ DATE ‘2010-02-16’, 498,

    I mean taking exact values of histograms should be better than taking the values, dividing it by no histograms, rounding off to nearest decimal and then multiply by no of histograms used.

    It would have made more sense if it has used avg rows per histogram( in this case 50000/100=500)* no histogram values being retrieved (500*3=1500) which is clearly not the case.

    Cheers !!

  • Avatar
    Soumen Chatterjee says:

    “By extending the date range to 2016-02-28, the plan changes”
    the Date must be 2010-02-28 and not 2016-02-28.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >