Missing statistics on indexed columns most times is not a big deal. The Optimizer substitutes them with random-AMP sampled statistics. If tables are not heavily skewed, sampled estimations will be enough to create a good execution plan.

But if you are omitting statistics for non-indexed WHERE condition columns, workload performance will be suboptimal or bad.

This article discusses the weak approach the Optimizer uses to estimate cardinality for non-indexed columns.

## Teradata Heuristics – A Nasty Surprise

If there are no collected statistics on non-indexed WHERE condition columns, a very basic method for estimating the Optimizer uses the number of rows:

If the WHERE conditions are combined with AND, the below formula is applied:

SELECT * FROM The_Table

WHERE

first_column = ‘A’ — 10% of the table cardinality

AND second_column = ‘B’ — 75% of above estimation

AND third_column =’C’ — 75% of the previous estimation

…

AND nth column = ‘X’; — 75% of the estimation of column n-1

The estimation is: ((estimated number of rows in table) * 0,1) * ( 0,75 * 0,75 * … * 0,75) rows

If the WHERE conditions are combined with OR, this formula is applied:

SELECT * FROM The_Table

WHERE

first_column = ‘A’ — 10% of the table cardinality

OR second_column = ‘B’ — 10% of the table cardinality

OR third_column =’C’ — 10% of the table cardinality

…

OR nth column = ‘X’; — 10% of the table cardinality

The estimation is: (estimated number of rows in the table) * 0.1 * n

Remark: The table cardinality (the number of table rows) is always based on the collected or random-AMP statistics for the primary index.

This is the nasty surprise when dealing with heuristics:

A short remark about the estimation of columns joined with OR:

The rule which is described above is implemented in Teradata 14.10. Previous releases of Teradata decrease above estimations by 1% for each intersection (set theory), such as:

A OR B: 1 intersection, the estimation is decreased by 1%

A OR B OR C: 3 intersections, the estimation is decreased by 3%

Altogether: Heuristics are a bad way to estimate the cardinality of retrieve steps.

## The Optimizer’s Toolkit to improve Estimations

Luckily, the Optimizer can make use of additional information to improve estimations for non-indexed WHERE condition columns:

If at least one column with **collected statistics** is accessed in the WHERE condition, this column’s estimation will be used as a starting point, instead of the inaccurate 10% rule. For the remaining columns, still, the heuristic approach is applied:

COLLECT STATISTICS ON The_Table COLUMN first_column;

We assume that the statistics on column first_column determine the number of rows to be 5,000.

SELECT * FROM The_Table

WHERE

first_column = ‘A’ — 5,000

AND second_column = ‘B’ — 75% of above estimation

The estimation is: 3,750 rows

The above example shows how the Optimizer uses available statistics to improve the “starting point” for its further estimations.

But what happens if we collect statistics on both columns (first_column and second_column)?

In this case, the Optimizer will choose – as a starting point – the column’s statistics, which has the highest selectivity and apply the usual heuristics to the second column – without considering the second column’s collected statistics.

COLLECT STATISTICS ON The_Table COLUMN first_column; – 5,000 rows with value ‘A’

COLLECT STATISTICS ON The_Table COLUMN second_column; – 200 rows with value ‘B’

SELECT * FROM The_Table

WHERE

first_column = ‘A’ — 75% of below estimation

AND second_column = ‘B’ — 200 rows

The estimation is: 150 rows

It becomes even better:

## The Twisted World of the Optimizer: How Skew improves the Estimations

The Optimizer uses the skew information it has about column first_column, to make the estimation even more accurate:

Here is the “skew rule” the Optimizer applies:

If the column first_column is available in more than 75% of the rows (which is known because the statistics histogram stores biased values), the 75% estimation will be replaced with a better estimation, namely, the real number of rows (the frequency of the skewed value).

Here is again our example from above. Additionally, we assume that 90% of all rows contain value ‘A’ in column first_column. It’s skewed, the histograms for biased values can be used:

COLLECT STATISTICS ON The_Table COLUMN first_column; – 5,000 rows with value ‘A’

COLLECT STATISTICS ON The_Table COLUMN second_column; – 200 rows with value ‘B’

SELECT * FROM The_Table

WHERE

first_column = ‘A’ — 90% of below estimation

AND second_column = ‘B’ — 200 rows

The estimation is: 180 rows

By making use of the information about biased values, the estimations become better.

If you are wondering how even better estimations could be achieved (after learning that the Optimizer will at best use the collected statistics of exactly one WHERE condition column):

The solution is Multi-column statistics.

COLLECT STATISTICS ON The_Table COLUMN (first_column,second_column);

SELECT * FROM The_Table

WHERE

first_column = ‘A’

AND second_column = ‘B’;

By adding multi-column statistics, the Optimizer does not depend anymore on heuristics.

There is another reason you should avoid heuristics:

Whenever at least one column’s cardinality is estimated with heuristics, the complete retrieve steps end up with “no confidence”, taking optimization possibilities from the Optimizer (a conservative execution plan will be created).

I hope you enjoyed it and feel free to ask any questions if something is missing or unclear.

Hi Akilan, especially for you. My new article covering this topic:

https://www.dwhpro.com/teradata-multi-column-statistics/

Hi –

Its good stuff. Kindly clarify the below one,

Multi-column stats are defined and few columns are part of the SQL. How the estimation will happen there?

COLLECT STATISTICS ON The_Table COLUMN (first_column,second_column);

SELECT * FROM The_Table

WHERE

first_column = ‘A’;

Thanks

This is an awesome explanation.

Very good explanation, thank you!