Teradata Statistics Extrapolation for Equality Predicates

2
185

teradata statistics extrapolationThe object use count feature (OUC), which is available since Teradata 14.10, brought some further improvements for statistics extrapolation.

In Teradata 13.10 the Optimizer uses random-AMP sampling to detect data growth, which can be enough for tables without skew.

The weak point of this approach shows up as soon as you insert and delete rows without recollecting statistics. The Teradata 13.10 Optimizer extrapolates with heuristics to avoid underestimation of row counts in such cases.

The below test scenario proves the benefit of improvements introduced with Teradata 14.10.

Here are the details of our test setup. It’s used on both Teradata versions, 13.10 and 14.10:

CREATE TABLE TheDB.TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(01)
) PRIMARY INDEX (PK);

INSERT INTO TheDB.TheTable
SELECT
ROW_NUMBER() OVER (ORDER BY 1) AS PK,
TRIM(CASE WHEN RANDOM(1,1000) <= 500 THEN ‘A’ ELSE ‘B’ END) AS TheCol,
FROM
(
SELECT NULL AS X FROM TheDB.AnyBigTable SAMPLE 200000
) x;

COLLECT STATISTICS ON TheDB.TheTable COLUMN(PK);
COLLECT STATISTICS ON TheDB.TheTable COLUMN(TheCol);

 

Teradata Statistics Extrapolation – Happens always with 13.10

At first we are executing below query on Teradata 13.10, to show the behaviour up to release 13.10:

EXPLAIN SELECT * FROM TheDB.TheTable  WHERE TheCol  = ‘C’;

3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of
an all-rows scan with a condition of (“TheDB.TheTable.TheCol =  ‘C ‘”) into Spool 1 (group_amps),
which is built locally on the   AMPs.  The size of Spool 1 is estimated with high confidence to be  20,000 rows (42,580,000 bytes).
The estimated time for this step is 0.31 seconds.

 

We are selecting a value which is not available in column TheCol (‘C’). Still, the Optimizer extrapolates and estimates 20,000 rows (10% of the table rows).

Why is the Optimizer doing this? To avoid the underestimation of row counts in the case of

  • Inserts and deletes
  • Without refreshing the statistics afterward

Still, the estimation is wrong: the real number of selected rows is 0.

Statistics Extrapolation on Teradata 14.10 – only if needed!

Luckily, starting with Teradata 14.10, object use count (OUC) can be utilized by the Optimizer to detect data changes since the last statistics collection.

The Optimizer queries the object use count (OUC). It avoids statistics extrapolation for unchanged tables. A vast improvement!

We repeat the execution of above SQL select statement but this time on Teradata 14.10. Now it has the following execution plan and estimations:

3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of
an all-rows scan with a condition of (“TheDB.TheTable.TheCol = ‘C ‘”) into Spool 1 (group_amps),
which is built locally on the AMPs.  The size of Spool 1 is estimated with high confidence to be 1 row (1,079 bytes).
The estimated time for this step is 0.04 seconds.

 

Notice that the Optimizer has “high confidence” on both Teradata versions. Still, the estimation on Teradata 14.10 is better (correct) because of OUC. Teradata 13.10 relies on the inaccurate heuristics to estimate changes done since the last statistics collection.

See also:
Don’t get fooled by the Statistic Estimations
Teradata Statistics – A Case Study

Our Reader Score
[Total: 6    Average: 5/5]
Teradata Statistics Extrapolation for Equality Predicates written by Roland Wenzlofsky on June 30, 2015 average rating 5/5 - 6 user ratings

2 COMMENTS

  1. how come the seocd time with high confiden e with one row where as there is now for the result infact?
    ideally it shoulf say zero rows or no rows returned rite?

    • You are right. Nevertheless, usually the result is at least estimated to be one row. I don’t know why but I assume (and I am guessing here) that the reasons might be:

      (1)

      Division by zero error handling when creating the execution plan can be avoided.

      (2)

      It makes calculations much easier if at least 1 row is assumed. For example:
      How to estimated the percentage for 0 rows being returned? Assuming one row simplifies things a lot.

      But as i said, I am just guessing here and this are the reasons I could think about.

      Maybe somebody working at R & D in Teradata is reading this and has a better answer?

      Roland

LEAVE A REPLY

Please enter your comment!
Please enter your name here