## Teradata Statistics Histograms – A Short Introduction

Most of us are familiar with the optimizer’s confidence levels for statistics. Today I was surprised by the fact that “high confidence” doesn’t mean automatically that we will have a correct or almost correct estimation (given that the collected statistics are not stale). Although I still hope my observations could be related to a bug, I would like to share the following information with you.

There are several ways the optimizer can use collected statistics for predicates used in WHERE conditions.

They can be utilized on their own or supported by advanced techniques such as random AMP-sampling (to detect staleness), extrapolation, etc.

The Optimizer can use the correct cardinality directly from the biased value histogram or pick up the cardinality information from one or several so-called equal height histograms. In this case, the estimated number of rows might deviate from the real cardinality if we are not retrieving the most frequent value of the histogram.

While the most frequent value in each equal height histogram is stored separately and together with its cardinality, the cardinality of other values in the related histogram is calculated in a simple way:

Number of Other Values (i.e. not the most frequent in the histogram) / Distinct Other Values = Average Rows per Value

Average Rows per Value is in a next step used as the base for the cardinality calculation. For example, let’s assume we have the following query:

SELECT * FROM TheTable WHERE COL1 = 5;

Our histogram carries the values 1,2,3,4,5 and we assume that values in (1,2,3,4) occurs 5 times but value=5 occurs 30 times. This means that we have:

-> 50 “Other values” (not being the most frequent of the histogram)

-> 5 distinct “other values”

The Average Rows per Value is 50/5 = 10.

In above example query, the optimizer would estimate the cardinality to be 10, but the real number is 30.

Any other WHERE condition is hitting the same histogram, such as SELECT * FROM TheTable WHERE COL1 = 1, would luckily deliver the correct cardinality (i.e. 10).

Nevertheless, we can expect that the number of estimated rows for statistics (not being stale and being used by the optimizer with “high confidence”) will be close to the real number of rows.

Until today, I thought this is the story. But then I was surprised when i tried to optimize the following query. Be aware that there was no history records in the histogram available, no other statistics then the one below collected:

COLLECT STATISTICS COLUMN (COL1) ON TheDatabase.TheTable;

SELECT * FROM TheDatabase.TheTable WHERE COL1= ‘ZZZ’;

The value ‘ZZZ’ was not available in my table, you can derive this fact as well from the below SHOW STATISTICS output:

SHOW STATISTICS VALUES COLUMN(COL1) ON TheDatabase.TheTable;

/** SummaryInfo **/

/* Version */ 6,

/* OriginalVersion */ 6,

/* DBSVersion */ ‘14.10.06.03’,

/* UsageType */ ‘D’,

/* ComplexStatInfo */ ‘ComplexStatInfo’,

/* NumOfBiasedValues */ 4,

/* NumOfEHIntervals */ 9,

/* NumOfHistoryRecords */ 0,

/* HighModeFreq */ 26412500,

/* NumOfDistinctVals */ 13,

/* NumOfRows */ 65057255,

/** Biased: Value, Frequency **/

/* 1 */ ‘Text0’, 253267,

/* 2 */ ‘Text99’, 26412500,

/* 3 */ ‘Text25’, 16767796,

/* 4 */ ‘Text10’, 21611177,

/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/

/* 1 */ ‘Text1’, ‘Text1’, 55, 55, 0, 0,

/* 2 */ ‘Text2’, ‘Text2’, 9840, 9840, 0, 0,

/* 3 */ ‘Text3’, ‘Text3’, 2, 2, 0, 0,

/* 4 */ ‘Text4’, ‘Text4’, 1965, 1965, 0, 0,

/* 5 */ ‘Text5’, ‘Text5’, 1, 1, 0, 0,

/* 6 */ ‘Text6’, ‘Text6’, 10, 10, 0, 0,

/* 7 */ ‘Text7’, ‘Text7’, 4, 4, 0, 0,

/* 8 */ ‘Text8’, ‘Text8’, 3, 3, 0, 0,

/* 9 */ ‘Text9’, ‘Text9′, 635, 635, 0, 0

COL1=’ZZZ’ is not hitting a biased value (Text0,Text99,Text25,Text10) or any of the 9 equal height histograms. I would have expected that the cardinality estimation therefore would be 0 rows, but executing the EXPLAIN statement, i saw this:

Explain SELECT * FROM TheDatabase.TheTable WHERE COL1 = ‘ZZZ’

1) First, we lock a distinct TheDatabase.”pseudo table”

for read on a RowHash to prevent global deadlock for

TheDatabase.TheTable.

2) Next, we lock TheDatabase.TheTable for read.

3) We do an all-AMPs RETRIEVE step from

TheDatabase.TheTable by way of an all-rows

scan with a condition of (

“TheDatabase.TheTable.COL1 = ‘XYZ'”) into

Spool 1 (group_amps), which is built locally on the AMPs. The

input table will not be cached in memory, but it is eligible for

synchronized scanning. The size of Spool 1 is estimated with high

confidence to be **5,004,405** rows (3,337,938,135 bytes). The

estimated time for this step is 37.22 seconds.

4) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of

statement 1. The total estimated time is 37.22 seconds.

**The estimation was about 5 Million rows with high confidence!**

This looks very strange to me, as the Optimizer should know (in my opinion) that none of the histograms has the selected value, the estimated cardinality should be 0.

But was the optimizer does, in this case, is the following calculation:

Number of Table Rows (65,057,255) / Number of Distinct Values (13) = 5,004,405 (The Number of table rows and distinct values you can see in above SHOW STATISTICS output).

The rule is: We have on average 5,004,405 rows per value. As I don’t have the WHERE condition value = ‘ZZZ’ in my histograms, I will use the average value for my estimations. As you can see in my example, this is leading to a completely wrong estimation, even with “high confidence”.

But what worries me even more: The estimation error is growing together with the table size!

Anyway, there is a good message as well: If you select a column value which is available together with a value from the same column which does not exist, the total estimated cardinality is limited to the number of table rows.

As I mentioned at the begin of this article, I hope this is a bug. If anyone has a more detailed insight, please let us know by leaving a comment. Thanks!

The same issue is discussed here: http://forums.teradata.com/t5/Database/Wrong-Statistic-Estimations-of-Optimizer-Teradata-14-10/m-p/56172

While, at first sight, I’d agree with you that this is some kind of a bug, dnoeth interpretation makes sense.

Consider it this way: in an application you can use an “in memory” cache to return data on some key. What if the user asks for a key that is not present in the cache? A common solution is to go back to the real (slow) data and check, because we’d rather take the (hopefully rare) hit, rather than not returning existing data to the user.

In this scenario, statistics could be considered as some sort of “cache”. When the system receives a value in input that is not in the cache, this makes the system doubt the freshness of the data, and go back to heuristics.