## Teradata Statistics Histograms – A Short Introduction

Most of us are familiar with Optimizer’s confidence levels for statistics. Today I was surprised that “high confidence” doesn’t mean we will have a correct or almost correct estimation (given that the collected statistics are not stale). Although I still hope my observationsÂ could be related to a bug, IÂ would like to share the following information with you.

There are several ways the Optimizer can use collected statistics for predicates used in WHERE conditions.

They can be utilized independently or supported by advanced techniques such as random AMP-sampling (to detect staleness), extrapolation, etc.

The Optimizer can use the correct cardinality directly from the biased value histogram or pick up the cardinality information from one or several so-called equal height histograms. In this case, the estimated number of rows might deviate from the actual cardinality if we do not retrieve the histogram’s most frequent value.

While the most frequent value in each equal height histogram is stored separately and together with its cardinality, the cardinality of other values in the related histogram is calculated simply:

Number of Other Values (i.e., not the most frequent in the histogram) / Distinct Other Values = Average Rows per Value

Average Rows per Value are in the next step and used as the base for the cardinality calculation. For example, let’s assume we have the following query:

SELECT * FROM TheTable WHERE COL1 = 5;

Our histogram carries the values 1,2,3,4,5 and we assume that values in (1,2,3,4) occurs 5 times but value=5 occurs 30 times. This means that we have:

-> 50 “Other values” (not being the most frequent of the histogram)

-> 5 distinct “other values”

The Average Rows per Value is 50/5 = 10.

In the above example query, the Optimizer would estimate the cardinality to be 10, but the actual number is 30.

Any other WHERE condition is hitting the same histogram, such as SELECT * FROM TheTable WHERE COL1 = 1, would luckily deliver the correct cardinality (i.e., 10).

Nevertheless, we can expect that the number of estimated rows for statistics (not being stale and being used by the Optimizer with “high confidence”) will be close to the actual number of rows.

Until today, I thought this was the story. But then I was surprised when I tried to optimize the following query. Be aware that there were no historical records in the histogram available, no other statistics than the one below collected:

COLLECT STATISTICS COLUMN (COL1) ON TheDatabase.TheTable;

SELECT * FROM TheDatabase.TheTable WHERE COL1= ‘ZZZ’;

The value ‘ZZZ’ was not available in my table, you can derive this fact as well from the below SHOW STATISTICS output:

SHOW STATISTICS VALUES COLUMN(COL1) ON TheDatabase.TheTable;

/** SummaryInfo **/

/* Version */ 6,

/* OriginalVersion */ 6,

/* DBSVersion */ ‘14.10.06.03’,

/* UsageType */ ‘D’,

/* ComplexStatInfo */ ‘ComplexStatInfo’,

/* NumOfBiasedValues */ 4,

/* NumOfEHIntervals */ 9,

/* NumOfHistoryRecords */ 0,

/* HighModeFreq */ 26412500,

/* NumOfDistinctVals */ 13,

/* NumOfRows */ 65057255,

/** Biased: Value, Frequency **/

/* 1 */ ‘Text0’, 253267,

/* 2 */ ‘Text99’, 26412500,

/* 3 */ ‘Text25’, 16767796,

/* 4 */ ‘Text10’, 21611177,

/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/

/* 1 */ ‘Text1’, ‘Text1’, 55, 55, 0, 0,

/* 2 */ ‘Text2’, ‘Text2’, 9840, 9840, 0, 0,

/* 3 */ ‘Text3’, ‘Text3’, 2, 2, 0, 0,

/* 4 */ ‘Text4’, ‘Text4’, 1965, 1965, 0, 0,

/* 5 */ ‘Text5’, ‘Text5’, 1, 1, 0, 0,

/* 6 */ ‘Text6’, ‘Text6’, 10, 10, 0, 0,

/* 7 */ ‘Text7’, ‘Text7’, 4, 4, 0, 0,

/* 8 */ ‘Text8’, ‘Text8’, 3, 3, 0, 0,

/* 9 */ ‘Text9’, ‘Text9’, 635, 635, 0, 0

COL1=’ZZZ’ is not hitting a biased value (Text0,Text99,Text25,Text10) or any of the 9 equal height histograms. I would have expected that the cardinality estimation, therefore would be 0 rows, but executing the EXPLAIN statement, I saw this:

Explain SELECT * FROM TheDatabase.TheTable WHERE COL1 = ‘ZZZ’

1) First, we lock a distinct TheDatabase.”pseudo table”

for read on a RowHash to prevent global deadlock for

TheDatabase.TheTable.

2) Next, we lock TheDatabase.TheTable for read.

3) We do an all-AMPs RETRIEVE step from

TheDatabase.TheTable by way of an all-rows

scan with a condition of (

“TheDatabase.TheTable.COL1 = ‘XYZ'”) into

Spool 1 (group_amps), which is built locally on the AMPs. The

input table will not be cached in memory, but it is eligible for

synchronized scanning. The size of Spool 1 is estimated with high

confidence to be **5,004,405** rows (3,337,938,135 bytes). The

estimated time for this step is 37.22 seconds.

4) Finally, we send out an END TRANSACTION step to all AMPs involved

in processing the request.

-> The contents of Spool 1 are sent back to the user as the result of

statement 1. The total estimated time is 37.22 seconds.

**The estimation was about 5 Million rows with high confidence!**

This looks very strange to me, as the Optimizer should know (in my opinion) that none of the histograms has the selected value. The estimated cardinality should be 0.

But was the Optimizer does, in this case, is the following calculation:

Number of Table Rows (65,057,255) / Number of Distinct Values (13) = 5,004,405 (The Number of table rows and distinct values you can see in above SHOW STATISTICS output).

The rule is: We have, on average, 5,004,405 rows per value. As I don’t have the WHERE condition value = ‘ZZZ’ in my histograms, I will use the average value for my estimations. In my example, this leads to an entirely wrong estimate, even with “high confidence”.

But what worries me even more, is that the estimation error is growing together with the table size!

Anyway, there is a good message: If you select a column value that is available and a value from the same column that does not exist, the total estimated cardinality is limited to the number of table rows.

At the beginning of this article, I hope this is a bug. If anyone has more detailed insight, please let us know by commenting. Thanks!

While, at first sight, I’d agree with you that this is some kind of a bug, dnoeth interpretation makes sense.

Consider it this way: in an application, you can use an “in memory” cache to return data on some key. What if the user asks for a key that is not present in the cache? A common solution is to go back to the real (slow) data and check, because we’d rather take the (hopefully rare) hit, rather than not return existing data to the user.

In this scenario, statistics could be considered as some sort of “cache”. When the system receives a value in input that is not in the cache, this makes the system doubt the freshness of the data, and go back to heuristics.