Teradata Statistics Histograms – A Short Introduction
Many are familiar with Optimizer’s statistical confidence levels. I was recently surprised to discover that a “high confidence” rating does not guarantee a fully accurate estimation (provided the statistics collected are not stale). While I remain hopeful that my observations may be attributed to a bug, I wanted to share this information with you.
The Optimizer can utilize gathered statistics for predicates used in WHERE clauses in various ways.
They can be used alone or in conjunction with advanced methods like random AMP sampling (for detecting staleness) and extrapolation.
The Optimizer can obtain accurate cardinality from the biased value histogram or multiple equal height histograms. However, if the histogram’s most frequent value is not retrieved, the estimated row count may differ from the actual cardinality.
The most common value in every histogram with equal heights is stored along with its frequency, while other values in the histogram are computed.
Number of Other Values (i.e., not the most frequent in the histogram) / Distinct Other Values = Average Rows per Value
The next step involves determining the Average Rows per Value, the foundation for calculating cardinality. For instance, consider the following query:
SELECT * FROM TheTable WHERE COL1 = 5;
The histogram displays values 1 through 5. Values 1 through 4 have a frequency of 5, while the value 5 has a frequency of 30.
-> 50 “Other values” (not being the most frequent of the histogram)
-> 5 distinct “other values”
The Average Rows per Value is 50/5 = 10.
In the example query, the Optimizer estimated the cardinality as 10 when it was 30.
Using any other WHERE condition that matches the same histogram, for example, SELECT * FROM TheTable WHERE COL1 = 1, will, fortunately, yield the accurate cardinality of 10.
However, we can anticipate that the estimated row count for statistics, which are not stale and relied upon by the Optimizer with a “high level of confidence,” will closely approximate the actual number of rows.
Previously, I believed this to be the narrative. However, I was astounded when I sought to enhance the subsequent inquiry. Note that there were no archival data in the histogram at my disposal, and solely the statistics provided below were gathered:
COLLECT STATISTICS COLUMN (COL1) ON TheDatabase.TheTable;
SELECT * FROM TheDatabase.TheTable WHERE COL1= 'ZZZ';
‘ZZZ’ was not found in the table. This information is also evident in the following SHOW STATISTICS output:
SHOW STATISTICS VALUES COLUMN(COL1) ON TheDatabase.TheTable;
/** SummaryInfo **/
/* Version */ 6,
/* OriginalVersion */ 6,
/* DBSVersion */ '14.10.06.03',
/* UsageType */ 'D',
/* ComplexStatInfo */ 'ComplexStatInfo',
/* NumOfBiasedValues */ 4,
/* NumOfEHIntervals */ 9,
/* NumOfHistoryRecords */ 0,
/* HighModeFreq */ 26412500,
/* NumOfDistinctVals */ 13,
/* NumOfRows */ 65057255,
/** Biased: Value, Frequency **/
/* 1 */ 'Text0', 253267,
/* 2 */ 'Text99', 26412500,
/* 3 */ 'Text25', 16767796,
/* 4 */ 'Text10', 21611177,
/** Interval: MaxVal, ModeVal, ModeFreq, LowFreq, OtherVals, OtherRows **/
/* 1 */ 'Text1', 'Text1', 55, 55, 0, 0,
/* 2 */ 'Text2', 'Text2', 9840, 9840, 0, 0,
/* 3 */ 'Text3', 'Text3', 2, 2, 0, 0,
/* 4 */ 'Text4', 'Text4', 1965, 1965, 0, 0,
/* 5 */ 'Text5', 'Text5', 1, 1, 0, 0,
/* 6 */ 'Text6', 'Text6', 10, 10, 0, 0,
/* 7 */ 'Text7', 'Text7', 4, 4, 0, 0,
/* 8 */ 'Text8', 'Text8', 3, 3, 0, 0,
/* 9 */ 'Text9', 'Text9', 635, 635, 0, 0
COL1=’ZZZ’ is not hitting a biased value (Text0, Text99, Text25, Text10) or any of the 9 equal height histograms. I would have expected that the cardinality estimation, therefore, would be 0 rows, but executing the EXPLAIN statement, I saw this:
Explain SELECT * FROM TheDatabase.TheTable WHERE COL1 = ‘ZZZ’
1) First, we lock a distinct TheDatabase."pseudo table"
for read on a RowHash to prevent global deadlock for
TheDatabase.TheTable.
2) Next, we lock TheDatabase.TheTable for read.
3) We do an all-AMPs RETRIEVE step from
TheDatabase.TheTable by way of an all-rows
scan with a condition of (
"TheDatabase.TheTable.COL1 = 'XYZ'") into
Spool 1 (group_amps), which is built locally on the AMPs. The
input table will not be cached in memory but is eligible for
synchronized scanning. The size of Spool 1 is estimated with high
confidence to be 5,004,405 rows (3,337,938,135 bytes). The
estimated time for this step is 37.22 seconds.
4) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> The contents of Spool 1 are sent back to the user as the result of
statement 1. The total estimated time is 37.22 seconds.
The estimation was about 5 Million rows with high confidence!
This appears peculiar to me as the Optimizer should know that none of the histograms contain the designated value. The projected cardinality must be zero.
The Optimizer performs the following calculation:
The number of Table Rows (65,057,255) / Number of Distinct Values (13) = 5,004,405 (The Number of table rows and distinct values you can see in the above SHOW STATISTICS output).
The rule is: We have, on average, 5,004,405 rows per value. As I don’t have the WHERE condition value = ‘ZZZ’ in my histograms, I will use the average value for my estimations. In my example, this leads to an entirely wrong estimate, even with “high confidence”.
I am even more concerned that the margin of error increases proportionally to the size of the table.
Selecting an available column value and a non-existent value from the same column will limit the total estimated cardinality to the number of table rows. This is a positive message to take away.
I hope this issue is a bug. If you have further information, kindly share it by commenting. Thank you!
While, at first sight, I’d agree with you that this is some kind of a bug, dnoeth interpretation makes sense.
Consider it this way: in an application, you can use an “in memory” cache to return data on some key. What if the user asks for a key that is not present in the cache? A common solution is to go back to the real (slow) data and check, because we’d rather take the (hopefully rare) hit, rather than not return existing data to the user.
In this scenario, statistics could be considered as some sort of “cache”. When the system receives a value in input that is not in the cache, this makes the system doubt the freshness of the data, and go back to heuristics.