Teradata Heuristics – How Skew improves the Estimations

Roland Wenzlofsky

June 16, 2015

minutes reading time


Missing statistics on indexed columns most times is not a big deal. The Optimizer substitutes them with random-AMP sampled statistics. If tables are not heavily skewed, sampled estimations will be enough to create a good execution plan.

But if you are omitting statistics for non-indexed WHERE condition columns, workload performance will be suboptimal or wrong.

This article discusses Optimizer’s weak approach to estimating cardinality for non-indexed columns.

Teradata Heuristics – A Nasty Surprise

If there are no collected statistics on non-indexed WHERE condition columns, an elementary method for estimating the Optimizer uses the number of rows:

If the WHERE conditions are combined with AND, the below formula is applied:

SELECT * FROM The_Table
WHERE
first_column = ‘A’                  — 10% of the table cardinality
AND second_column = ‘B’   — 75% of above estimation
AND third_column  =’ C’ — 75% of the previous estimation
…

AND nth column = ‘X’;           — 75% of the estimation of column n-1

The estimation is: ((estimated number of rows in table) * 0,1) * ( 0,75 * 0,75 * … * 0,75) rows

If the WHERE conditions are combined with OR, this formula is applied:

SELECT * FROM The_Table
WHERE
first_column = ‘A’                   — 10% of the table cardinality
OR  second_column = ‘B’      — 10% of the table cardinality
OR third_column  =’C’           — 10% of the table cardinality
…

OR nth column = ‘X’;              — 10% of the table cardinality

The estimation is: (estimated number of rows in the table) * 0.1 * n

Remark: The table cardinality (the number of table rows) is always based on the collected or random-AMP statistics for the primary index.

This is the nasty surprise when dealing with heuristics:

It makes no difference if your query selects millions of rows or no row at all: the estimations are stupid and most times either too high or too low.

A short remark about the estimation of columns joined with OR:

The rule which is described above is implemented in Teradata 14.10. Previous releases of Teradata decrease above estimates by 1% for each intersection (set theory), such as:

A OR B: 1 intersection, the estimation is decreased by 1%
A OR B OR C: 3 intersections, the estimation is decreased by 3%

Intersect

Altogether: Heuristics are a bad way to estimate the cardinality of retrieve steps.

The Optimizer’s Toolkit to improve Estimations

Luckily, the Optimizer can make use of additional information to improve estimations for non-indexed WHERE condition columns:

If at least one column with collected statistics is accessed in the WHERE condition, this column’s estimation will be used as a starting point, instead of the inaccurate 10% rule. For the remaining columns, still, the heuristic approach is applied:

COLLECT STATISTICS ON The_Table COLUMN first_column;

We assume that the statistics on column first_column determine the number of rows to be 5,000.
SELECT * FROM The_Table
WHERE
first_column = ‘A’                  — 5,000
AND second_column = ‘B’   — 75% of above estimation

The estimation is: 3,750 rows

The above example shows how the Optimizer uses available statistics to improve the “starting point” for further estimations.

But what happens if we collect statistics on both columns (first_column and second_column)?

In this case, the Optimizer will choose – as a starting point – the column’s statistics, which has the highest selectivity, and apply the usual heuristics to the second column – without considering the second column’s collected statistics.

COLLECT STATISTICS ON The_Table COLUMN first_column; – 5,000 rows with value ‘A’

COLLECT STATISTICS ON The_Table COLUMN second_column;  – 200 rows with value ‘B’
SELECT * FROM The_Table
WHERE
first_column = ‘A’                  — 75% of below  estimation
AND second_column = ‘B’   —  200 rows

The estimation is: 150 rows

It becomes even better:

The Twisted World of the Optimizer: How Skew improves the Estimations

The Optimizer uses the skew information it has about column first_column to make the estimation even more accurate:

Here is the “skew rule” the Optimizer applies:

Suppose the column first_column is available in more than 75% of the rows (because the statistics histogram stores biased values). In that case, the 75% estimation will be replaced with a better estimation, namely, the real number of rows (the frequency of the skewed value).

Here is again our example from above. Additionally, we assume that 90% of all rows contain the value ‘A’ in column first_column. It’s skewed. The histograms for biased values can be used:

COLLECT STATISTICS ON The_Table COLUMN first_column; – 5,000 rows with value ‘A’

COLLECT STATISTICS ON The_Table COLUMN second_column;  – 200 rows with value ‘B’
SELECT * FROM The_Table
WHERE
first_column = ‘A’                  — 90% of below  estimation
AND second_column = ‘B’   —  200 rows

The estimation is: 180 rows

By using the information about biased values, the estimations become better.

If you are wondering how even better estimations could be achieved (after learning that the Optimizer will at best use the collected statistics of exactly one WHERE condition column):

The solution is Multi-column statistics.

COLLECT STATISTICS ON The_Table COLUMN (first_column,second_column);
SELECT * FROM The_Table
WHERE
first_column = ‘A’
AND second_column = ‘B’;

By adding multi-column statistics, the Optimizer does not depend anymore on heuristics.

There is another reason you should avoid heuristics:

Whenever at least one column’s cardinality is estimated with heuristics, the complete retrieve steps end up with “no confidence”, taking optimization possibilities from the Optimizer (a conservative execution plan will be created).

I hope you enjoyed it, and feel free to ask any questions if something is missing or unclear.

  • Hi –

    It’s good stuff. Kindly clarify the below one,

    Multi-column stats are defined and a few columns are part of the SQL. How the estimation will happen there?

    COLLECT STATISTICS ON The_Table COLUMN (first_column,second_column);

    SELECT * FROM The_Table
    WHERE
    first_column = ‘A’;

    Thanks

    • Very good explanation, thank you!

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >