June 30

2 comments

Teradata Statistics Extrapolation

By Roland Wenzlofsky

June 30, 2015

object use count, statistics

The object use count feature (OUC), which is available since Teradata 14.10, brought some further improvements for statistics extrapolation.

In Teradata 13.10 the Optimizer uses random-AMP sampling to detect data growth, which can be enough for tables without skew.

The weak point of this approach shows up as soon as you insert and delete rows without recollecting statistics. The Teradata 13.10 Optimizer extrapolates with heuristics to avoid underestimation of row counts in such cases.

The below test scenario proves the benefit of improvements introduced with Teradata 14.10.

Here are the details of our test setup. It's used on both Teradata versions, 13.10 and 14.10:

CREATE TABLE TheDB.TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(01)
) PRIMARY INDEX (PK);

INSERT INTO TheDB.TheTable
SELECT
ROW_NUMBER() OVER (ORDER BY 1) AS PK,
TRIM(CASE WHEN RANDOM(1,1000) <= 500 THEN ‘A' ELSE ‘B' END) AS TheCol,
FROM
(
SELECT NULL AS X FROM TheDB.AnyBigTable SAMPLE 200000
) x;

COLLECT STATISTICS ON TheDB.TheTable COLUMN(PK);
COLLECT STATISTICS ON TheDB.TheTable COLUMN(TheCol);

Teradata Statistics Extrapolation – Happens always with 13.10

At first we are executing below query on Teradata 13.10, to show the behaviour up to release 13.10:

EXPLAIN SELECT * FROM TheDB.TheTable  WHERE TheCol  = ‘C';

3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of
an all-rows scan with a condition of (“TheDB.TheTable.TheCol =  ‘C ‘”) into Spool 1 (group_amps),
which is built locally on the   AMPs.  The size of Spool 1 is estimated with high confidence to be  20,000 rows (42,580,000 bytes).
The estimated time for this step is 0.31 seconds.

We are selecting a value which is not available in column TheCol (‘C’). Still, the Optimizer extrapolates and estimates 20,000 rows (10% of the table rows).

Why is the Optimizer doing this? To avoid the underestimation of row counts in the case of

  • Inserts and deletes
  • Without refreshing the statistics afterward

Still, the estimation is wrong: the real number of selected rows is 0.

Statistics Extrapolation on Teradata 14.10 – only if needed!

Luckily, starting with Teradata 14.10, object use count (OUC) can be utilized by the Optimizer to detect data changes since the last statistics collection.

The Optimizer queries the object use count (OUC). It avoids statistics extrapolation for unchanged tables. A vast improvement!

We repeat the execution of above SQL select statement but this time on Teradata 14.10. Now it has the following execution plan and estimations:

3) We do an all-AMPs RETRIEVE step from TheDB.TheTable by way of
an all-rows scan with a condition of (“TheDB.TheTable.TheCol = ‘C ‘”) into Spool 1 (group_amps),
which is built locally on the AMPs.  The size of Spool 1 is estimated with high confidence to be 1 row (1,079 bytes).
The estimated time for this step is 0.04 seconds.

Notice that the Optimizer has “high confidence” on both Teradata versions. Still, the estimation on Teradata 14.10 is better (correct) because of OUC. Teradata 13.10 relies on the inaccurate heuristics to estimate changes done since the last statistics collection.

See also:
Don't get fooled by the Statistic Estimations
Teradata Statistics – A Case Study

Roland Wenzlofsky


Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 20 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

You might also like

  • how come the second time with high confidence e with one row whereas there is now for the result in fact?
    ideally, it should say zero rows or no rows returned right?

    Reply

    • You are right. Nevertheless, usually, the result is at least estimated to be one row. I don’t know why but I assume (and I am guessing here) that the reasons might be:

      (1)

      Division by zero error handling when creating the execution plan can be avoided.

      (2)

      It makes calculations much easier if at least 1 row is assumed. For example:
      How to estimated the percentage for 0 rows being returned? Assuming one row simplifies things a lot.

      But as I said, I am just guessing here and this is the reason I could think about it.

      Maybe somebody working at R & D in Teradata is reading this and has a better answer?

      Roland

      Reply

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    Never miss a good story!

     Subscribe to our newsletter to keep up with the latest trends!

    >