What is Teradata Statistics?

Statistics provide the Optimizer with the data demographics it needs to generate good execution plans. It will create several plans to choose from and use the one with the lowest cost of resources.

All cost based optimizers use statistical information about data and hardware.

Teradata uses statistical information about tables (summary statistics) and columns. Table metrics are for example the number of rows, the average row size,  and the number of rows per data block.

Statistics about columns are more detailed than summary statistics.  The number of rows per value, the number of distinct values, and the lowest and highest value. Information about the most common column values used by the optimizer for join planning.

Statistics are used to improve the execution plans.

An execution plan consists of join-,  retrieve- or aggregation steps. Retrieve steps are caused by WHERE conditions.  Aggregation steps by GROUP BY statements.

Statistics impact in which order the Optimizer executes the plan steps and which technique it uses.

The Optimizer uses statistics to answer the following questions:

Retrieve Steps:

Is it cheaper to use an index or a full table scan? The Optimizer will quickly choose and use any primary index, and it will also easily pick a Unique Secondary Index, but statistics help the optimizer to decide whether to do a full table scan or use a nonunique secondary index and if nonunique secondary index bit mapping should be utilized.

Join Steps:

Which join type should be used (such as merge join, product join, hash join, etc.). When two rows are joined, they must be located on the same AMP (this means that the join columns have to be the primary index of both tables). The Optimizer has to find out what’s the cheapest way to bring the rows together: For example, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Last but not least, the optimizer has to decide about the ideal join order.

Above you saw a few examples how statistics influence the execution plan.

Statistics allow the Optimizer to create more risky execution plans by increasing the confidence in its estimations.

As an example, let’s consider the hash join which performs great if the smaller table can be kept entirely in memory, but has bad performance if the memory is not enough.

Without collected statistics, the hash join might be avoided to mitigate the risk of bad performance.

Collected statistics are only a starting point for estimations.  They are adjusted throughout the optimization process when more information from various database constraints such as CHECK constraints, referential integrity constraints, and query predicates are available. Adjusted estimations are called the “derived statistics”.

The Optimizer further refines the derived statistics each time it gets new details from:

– A join step
– Single table predicates (a WHERE condition)
– An Aggregation step

Here is one example:

Assume we have a WHERE condition like this one: “WHERE column IN (1,2,3). The Optimizer derives that the maximum number of distinct values of the result set will be 3 and uses this enhanced piece of information from now on to improve the estimations of the next execution steps.

Without collected statistics, the optimizer solely relies on random-AMP sampling when the Optimizer builds the execution plan. This approach gives good estimations if the data is evenly distributed (unique or nearly unique). Random-AMP sampling results in wrong estimates for skewed data.

Our Reader Score
[Total: 5    Average: 4/5]
Teradata Statistics Basics written by Roland Wenzlofsky on January 28, 2017 average rating 4/5 - 5 user ratings

5 COMMENTS

  1. Hi DWH Pro,
    Thanks so much for your answers.
    Just Please let me ask another question about Join Steps case.
    In case of multi columns be as join conditions, Stats taken on every separate column helps optimizer decide
    or just take a multicolumns stats help? or maybe both will help?

  2. If no collected statistics are available on the GROUP BY columns, the Optimizer will apply heuristic estimations. Taking the estimated number of rows of an aggregation from collected statistics gives the Optimizer a better chance to create a good execution plan.

    • Thanks a lot. I got the point.
      I know stats taken on aggregate columns will help optimizer know the estimated number to decide further steps.
      But if just in case of a simple select without further query using the derived results, will it still benefit the query?

      like if I just put a query “select col_a,count(*) from TBLname group by 1”, will it be faster if I take stats on col_a?(Will system give more resource to the query if Optimizer knows the cnts accurately? )

      Regards

      • Yes, statistics can definitely even make a difference in your simple example. The Optimizer might recognize, for example, that a scan of an existing NUSI subtable is cheaper (from an I/O point of view) than a full table scan of the base table.

        Best Regards,
        Roland

  3. AS mentioned, Retrieve are caused by WHERE . Aggregation by GROUP BY .
    I wonder how to benefit aggregation from statistics and how to collect stats on aggregation improvement

LEAVE A REPLY

Please enter your comment!
Please enter your name here