In this article, we explain what collected Teradata Statistics are and what we have to pay attention to when we use them. We also give concrete tips on when to use collected statistics and what alternatives are available.
Teradata uses a cost-based optimizer and requires different metrics to assess the cost of a query, create different execution plans, and finally use the best plan. Statistics is one of the most important key figures and is information about the data demography of tables and indexes.
How the Optimizer uses Statistics
Execution Plans consist of individual steps that are executed sequentially or in parallel. For retrieve steps, statistics help the optimizer to choose the right access path (index or full table scan), but primary index access is always the first choice, even without statistics, as it is the most efficient.
Furthermore, statistics help the optimizer to select the optimal join strategy. This includes the order of the joins as well as the join type (merge jon, product join, etc.) and how the data is distributed to the AMPs (if necessary).
A join preparation step can duplicate rows to all AMPs or redistribute by them by the row hash of the join columns. Some join types require the tables to be sorted by row hash. Each of these preparation steps causes costs that have to weighted against each other. Statistics support the optimizer in this decision.
Statistics make the optimizer create more efficient but riskier execution plans.
The hash join, for example, performs best if the smaller table of a join can be kept entirely in memory, but has suboptimal performance if this is not the case. Without statistics, the optimizer might avoid the hash join to mitigate the risk of a bad plan.
What are Collected Statistics
Statistics in Teradata are information about data such as the number of table rows, the number of rows per column value, the number of distinct column values, the average row length, number of NULL values per column, the skew factor of a column, and much more. Statistics are used by the optimizer to generate the best possible execution plan. The optimizer uses them to determine which indexes are used, which join strategy is used and in which order joins are executed. Statistics can be collected on single columns, multiple columns, and entire tables. Table statistics give the optimizer different key figures than statistics on columns.
The definition and collection of required statistics are not automatic but must be defined by the user. Still, Teradata now offers a framework (called “Autostats feature”) to automatically create and schedule jobs that on the one hand determine which statistics the optimizer considers beneficial and on the other hand automate the collection of statistics. This framework is based on Viewpoint (as a scheduler), various DBC tables (for analysis), and procedures and can greatly simplify the creation and maintenance of statistics.
Collected statistics about columns are stored in intervals of histograms and as so-called summary statistics for information concerning the whole table. Teradata tries to store approximately the same number of different values in each interval. A column statistic can be a single column or a combination of columns. Intervals do not store the actual number of different values of a column but an approximation. For this purpose, the histogram is divided into 250 intervals by default and in each interval, the information about a range of values is stored in compressed form. The stored key figures are the maximum value of the interval, the most frequent value in the interval, and how often this value occurs, the number of different values in the interval (excluding the most frequent value), and how many rows contain these values. With 250 intervals, 0.4% of the rows in a table are in each interval.
The number of intervals can be reduced or increased. Multiple intervals make sense for skewed columns because several most frequent values can be stored. This can reduce the skew problem. Very skewed column values are not stored in the normal intervals but the number of rows for each value in the intervals for biased values. A histogram can therefore consist only of biased value intervals, only of equal-height intervals or a combination of both.
Teradata uses only the number of intervals that are really needed. If there are only 10 different values in a column, only 10 intervals will be created for the corresponding statistics; of course, we can reduce the number of intervals at any time, but this rarely makes sense.
The collection of statistics can be done for the whole table or a part of the table (to save resources). If only a part of the rows is used, this is called sample statistics.
Collected Statistics versus other Methods
We talk about Teradata’s dynamic AMP sampling and sample statistics in detail in other articles, we refer to add the end of this article. Should you miss the basics, please consider reading these articles first.
All considerations of how to collect statistics are based solely on one question: Are they sufficient to give the optimizer a sufficiently good picture of the data demography? As we can imagine, this is closely related to the skew factor of a table. Here are a few rules to help us make the right decision:
1. Small tables always need collected statistics.
Imagine the following scenario: Our Teradata system consists of 300 AMPs, and our table contains five rows. It is very likely that a dynamic AMP sample is taken from an AMP that does not contain rows. This will cause the optimizer to assume that the table is empty. This should be avoided; a product join could be the result. In any case, the execution plan will not be optimal. Since it is cheap to collect statistics on small tables you should not waste a second thinking about not doing this.
2. On a Unique Index (Primary Index, Unique Secondary Index) one can usually rely on dynamic AMP sampling since an even distribution of the data is given by definition. Alternatively one could collect sample statistics, this is to be considered for each case individually. The advantage of sample statistics could be that the optimizer can create a more aggressive (and hopefully better performing) execution plan. Complete statistics should be avoided if the table is large to avoid wasting resources.
Exception: if we are using non-equality constraints, we should always collect full statistics on them.
3. For indices that are not unique (NUPI, NUSI) the skew factor of the (index) table is again decisive if full statistics are needed. For NUSI the optimizer assumes that there is one subtable index row per value. If this is not the case, dynamic AMP sampling on the NUSI will give wrong results.
4. On non-indexed columns used in query conditions (where conditions) it is always advisable to collect either full or sample statistics (again, depending on whether sample statistics sufficiently represent the data demography). This prevents the optimizer from using heuristics.
Rules for Row Partitioned Tables
5. Collecting complete statistics on the dummy PARTITION column is an efficient way to give the optimizer information about empty partitions and the number of rows per partition. This information is used for the join costing of various join types (e.g. Sliding Window Merge Join).
6. We should always collect full or sample statistics on each partitioning column.
7. If the partition columns are not part of the Primary Index, we should additionally collect the following statistics:
PARTITION & Primary Index, as this tells the optimizer in how many partitions a particular primary index value is available (the optimizer uses this information in the case of dynamic partition elimination and sliding window merge joins).
PARTITION & Primary Index & each partitioning column: The optimizer uses these statistics to estimate costs for row key-based joins.
Rules for multi-column statistics
8. As the optimizer assumes independence for statistics collected on single columns, we should collect statistics on combinations of columns if there is a high correlation between them.