The Teradata Summary Statistics

Roland Wenzlofsky

June 30, 2022

minutes reading time


Teradata summary statistics are introduced with Teradata Release 14 and provide the Optimizer with statistics about the number of rows in a table. Teradata summary statistics are quite useful for performance tuning as they can be collected efficiently and the number of rows is the most important metric for creating the execution plan. They can therefore be used explicitly wherever collecting statistics on columns was too costly in terms of resources. The syntax is very simple:

COLLECT SUMMARY STATISTICS ON <TABLE>;

Teradata summary statistics include the number of rows per table, the average data block size, and the average row size. The number of rows per table is a critical measure for the Teradata Optimizer to estimate the cost of a full table scan.

In older versions of Teradata where there were no summary statistics, you had to collect statistics on the dummy column “PARTITION” and the primary index columns.

Summary statistics that are defined on a table will be refreshed each time the statistics on any table column are updated. If we want to avoid wasting resources, the best practice is to update all column statistics at once instead of column by column as shown below:

COLLECT STATISTICS ON <TABLE>; -- all column statistics and summary statistics are refreshed. 

Although less intense on resources than traditional column/index statistics, this should still be avoided:

COLLECT STATISTICS ON <TABLE> <COLUMN_1>; -- first collection of summary statistics
...
COLLECT STATISTICS ON <TABLE> <COLUMN_N>; -- needless nth collection of summary statistics

Teradata Summary Statistics are also suitable as a quick initial solution to enable the collection of statistics at the table level. Let’s say we want to refresh statistics on a table that does not yet have a single column statistic; we will receive the below error:

COLLECT STATISTICS ON <TABLE>;
COLLECT failed. [3624] There are no statistics defined for the table.

Creating initial SUMMARY statistics on all tables is fast and efficient and this problem no longer exists:

COLLECT SUMMARY STATISTICS ON <TABLE>;
COLLECT STATISTICS ON <TABLE>;
Collect complete. 1 rows processed

  • Hi Sreeraj. I think this is a misunderstanding.

    “All columns at once” means to collect statistics with the new syntax which allows collecting the stats on different columns of the same table at once.

    Before Teradata 14 we had to collect stats column by column i.e. issue one statement per column. The only performance advantage previously possible was synchronized scanning on the spool (this was achieved by starting the collect statistics statements for one table at the same time).

    “All columns at once” does not mean to collect statistics on each and every column of a table…

    Roland

  • Thank you. very informative.
    Question: for TD 14 onwards, the best practice is to collect all column statistics at once instead of column by column. is this true for big tables as well? I was under the impression that we need to collect stats only for certain columns based on the PI/join and WHERE. could you please clarify?

  • Very informative post.Really thank you! Awesome.

  • Avatar
    Dimitrios says:

    Hello,

    This new feature has only advantages, except from one thing, the way that we have to change the procedure which collects the stats in order to use this feature(s) (plus the max value length/max internal and sample ).

    It is also recommended to big tables, where collecting stats consumes a lot of io/CPU, just
    Collect Summary Stats on DB.TB, after every DML transaction on it, in order to have updated summary stats on the table, in this way we help the parser to make better extrapolations.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >