Optimizing Performance with Teradata Summary Statistics

Roland Wenzlofsky

April 20, 2023

minutes reading time


Teradata summary statistics were introduced in Release 14 to give the Optimizer vital information on table row counts. These statistics are valuable for performance tuning as they can be efficiently gathered and are crucial in creating an execution plan. They are especially useful when collecting column statistics requires significant resources. The syntax for collecting summary statistics is simple:

COLLECT SUMMARY STATISTICS ON <TABLE>;

Teradata summary statistics include the number of rows, average data block size, and average row size per table. The Teradata Optimizer heavily relies on the number of rows per table to approximate the cost of scanning a table, thus emphasizing its criticality.

In previous Teradata versions lacking summary statistics, gathering statistics on the “PARTITION” dummy column and primary index columns was imperative.

When defining summary statistics for a table, refreshing them occurs whenever any column statistics are updated. To optimize resource usage, it is recommended to update all column statistics at once, as exemplified below:

COLLECT STATISTICS ON <TABLE>; -- all column statistics and summary statistics are refreshed. 

Although less resource-intensive than traditional column/index statistics, one should still refrain from utilizing the following approach:

COLLECT STATISTICS ON <TABLE> <COLUMN_1>; -- first collection of summary statistics
...
COLLECT STATISTICS ON <TABLE> <COLUMN_N>; -- needless nth collection of summary statistics

Teradata summary statistics serve as a great starting point for gathering table-level statistics. When trying to update table statistics for a table lacking any column statistics, an error will arise:

COLLECT STATISTICS ON <TABLE>;
COLLECT failed. [3624] There are no statistics defined for the table.

Efficiently generating preliminary summary statistics for all tables can resolve this issue.

COLLECT SUMMARY STATISTICS ON <TABLE>;
COLLECT STATISTICS ON <TABLE>;
Collect complete. 1 rows processed

Teradata Statistics FAQ

What are Teradata statistics used for?

Teradata statistics are essential for the optimizer to generate efficient and accurate query execution plans. By collecting statistics on tables, columns, and indexes, the optimizer can better understand the data distribution, cardinality, and uniqueness within the database. This information helps the optimizer make informed decisions about join methods, access paths, and resource allocation when processing queries. In essence, Teradata statistics are crucial in improving query performance and ensuring optimal system resource usage.

How to show statistics in Teradata?

To show statistics in Teradata, you can use the HELP STATISTICS command. This command displays the available statistics for a specific table, index, or column. The syntax for the HELP STATISTICS command is as follows:

HELP STATISTICS database_name.table_name;

Replace database_name and table_name with the appropriate database and table names for which you want to display the statistics. You can also specify a particular column or index by appending it to the table name like this:

HELP STATISTICS database_name.table_name.column_name;

How often do you collect statistics in Teradata?

The frequency of collecting statistics in Teradata depends on various factors, such as the nature of the data, the rate of data changes, and the impact on query performance. As a general guideline, you should consider collecting statistics when:

  1. A table is newly created or loaded with data.
  2. A significant amount of data has been added, updated, or deleted (typically around 10% or more of the table’s total data).
  3. You observe a decline in query performance or the optimizer starts generating suboptimal execution plans.

You may need to collect statistics more often for tables with frequently changing data. Conversely, you can collect statistics less frequently for tables with relatively static data.

Why do we collect stats in Teradata?

Collecting statistics in Teradata is essential for maintaining optimal query performance and resource utilization. By gathering information on data distribution, cardinality, and uniqueness, the optimizer can make informed decisions when creating query execution plans. Collecting statistics helps the optimizer:

  1. Choose the most efficient join methods and access paths for the given data.
  2. Estimate the number of rows each operation will process and return, allowing for better resource allocation and parallelism.
  3. Determine the cost of different execution plan options, leading to selecting the most efficient plan.

In summary, collecting stats in Teradata is critical for ensuring efficient query execution and maximizing system performance.

  • Hi Sreeraj. I think this is a misunderstanding.

    “All columns at once” means to collect statistics with the new syntax which allows collecting the stats on different columns of the same table at once.

    Before Teradata 14 we had to collect stats column by column i.e. issue one statement per column. The only performance advantage previously possible was synchronized scanning on the spool (this was achieved by starting the collect statistics statements for one table at the same time).

    “All columns at once” does not mean to collect statistics on each and every column of a table…

    Roland

  • Thank you. very informative.
    Question: for TD 14 onwards, the best practice is to collect all column statistics at once instead of column by column. is this true for big tables as well? I was under the impression that we need to collect stats only for certain columns based on the PI/join and WHERE. could you please clarify?

  • Very informative post.Really thank you! Awesome.

  • Avatar
    Dimitrios says:

    Hello,

    This new feature has only advantages, except from one thing, the way that we have to change the procedure which collects the stats in order to use this feature(s) (plus the max value length/max internal and sample ).

    It is also recommended to big tables, where collecting stats consumes a lot of io/CPU, just
    Collect Summary Stats on DB.TB, after every DML transaction on it, in order to have updated summary stats on the table, in this way we help the parser to make better extrapolations.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >