Missing statistics on indexed columns most times is not a big deal. The Optimizer substitutes them with random-AMP sampled statistics. If tables are not heavily skewed, sampled estimations will be enough to create a good execution plan.
But if you are omitting statistics for non-indexed WHERE condition columns, workload performance will be suboptimal or wrong.
This article discusses Optimizer’s weak approach to estimating cardinality for non-indexed columns.
Teradata Heuristics – A Nasty Surprise
If there are no collected statistics on non-indexed WHERE condition columns, an elementary method for estimating the Optimizer uses the number of rows:
If the WHERE conditions are combined with AND, the below formula is applied:
SELECT * FROM The_Table
WHERE
first_column = ‘A’                 — 10% of the table cardinality
AND second_column = ‘B’   — 75% of above estimation
AND third_column =’ C’ — 75% of the previous estimation
…
AND nth column = ‘X’; — 75% of the estimation of column n-1
The estimation is: ((estimated number of rows in table) * 0,1) * ( 0,75 * 0,75 * … * 0,75) rows
If the WHERE conditions are combined with OR, this formula is applied:
SELECT * FROM The_Table
WHERE
first_column = ‘A’ — 10% of the table cardinality
OR second_column = ‘B’ — 10% of the table cardinality
OR third_column =’C’ — 10% of the table cardinality
…
OR nth column = ‘X’; — 10% of the table cardinality
The estimation is: (estimated number of rows in the table) * 0.1 * n
Remark: The table cardinality (the number of table rows) is always based on the collected or random-AMP statistics for the primary index.
This is the nasty surprise when dealing with heuristics:
A short remark about the estimation of columns joined with OR:
The rule which is described above is implemented in Teradata 14.10. Previous releases of Teradata decrease above estimates by 1% for each intersection (set theory), such as:
A OR B: 1 intersection, the estimation is decreased by 1%
A OR B OR C: 3 intersections, the estimation is decreased by 3%
Altogether: Heuristics are a bad way to estimate the cardinality of retrieve steps.
The Optimizer’s Toolkit to improve Estimations
Luckily, the Optimizer can make use of additional information to improve estimations for non-indexed WHERE condition columns:
If at least one column with collected statistics is accessed in the WHERE condition, this column’s estimation will be used as a starting point, instead of the inaccurate 10% rule. For the remaining columns, still, the heuristic approach is applied:
COLLECT STATISTICS ON The_Table COLUMN first_column;
We assume that the statistics on column first_column determine the number of rows to be 5,000.
SELECT * FROM The_Table
WHERE
first_column = ‘A’ — 5,000
AND second_column = ‘B’ — 75% of above estimation
The estimation is: 3,750 rows
The above example shows how the Optimizer uses available statistics to improve the “starting point” for further estimations.
But what happens if we collect statistics on both columns (first_column and second_column)?
In this case, the Optimizer will choose – as a starting point – the column’s statistics, which has the highest selectivity, and apply the usual heuristics to the second column – without considering the second column’s collected statistics.
COLLECT STATISTICS ON The_Table COLUMN first_column; – 5,000 rows with value ‘A’
COLLECT STATISTICS ON The_Table COLUMN second_column; – 200 rows with value ‘B’
SELECT * FROM The_Table
WHERE
first_column = ‘A’ — 75% of below estimation
AND second_column = ‘B’ — 200 rows
The estimation is: 150 rows
It becomes even better:
The Twisted World of the Optimizer: How Skew improves the Estimations
The Optimizer uses the skew information it has about column first_column to make the estimation even more accurate:
Here is the “skew rule” the Optimizer applies:
Suppose the column first_column is available in more than 75% of the rows (because the statistics histogram stores biased values). In that case, the 75% estimation will be replaced with a better estimation, namely, the real number of rows (the frequency of the skewed value).
Here is again our example from above. Additionally, we assume that 90% of all rows contain the value ‘A’ in column first_column. It’s skewed. The histograms for biased values can be used:
COLLECT STATISTICS ON The_Table COLUMN first_column; – 5,000 rows with value ‘A’
COLLECT STATISTICS ON The_Table COLUMN second_column; – 200 rows with value ‘B’
SELECT * FROM The_Table
WHERE
first_column = ‘A’ — 90% of below estimation
AND second_column = ‘B’ — 200 rows
The estimation is: 180 rows
By using the information about biased values, the estimations become better.
If you are wondering how even better estimations could be achieved (after learning that the Optimizer will at best use the collected statistics of exactly one WHERE condition column):
The solution is Multi-column statistics.
COLLECT STATISTICS ON The_Table COLUMN (first_column,second_column);
SELECT * FROM The_Table
WHERE
first_column = ‘A’
AND second_column = ‘B’;
By adding multi-column statistics, the Optimizer does not depend anymore on heuristics.
There is another reason you should avoid heuristics:
Whenever at least one column’s cardinality is estimated with heuristics, the complete retrieve steps end up with “no confidence”, taking optimization possibilities from the Optimizer (a conservative execution plan will be created).
I hope you enjoyed it, and feel free to ask any questions if something is missing or unclear.
Hi Akilan, especially for you. My new article covering this topic:
https://www.dwhpro.com/teradata-multi-column-statistics/
Hi –
It’s good stuff. Kindly clarify the below one,
Multi-column stats are defined and a few columns are part of the SQL. How the estimation will happen there?
COLLECT STATISTICS ON The_Table COLUMN (first_column,second_column);
SELECT * FROM The_Table
WHERE
first_column = ‘A’;
Thanks
This is an awesome explanation.
Very good explanation, thank you!