This article explains what collected Teradata Statistics are and what we must pay attention to when using them. We also give concrete tips on when to use collected statistics and what alternatives are available.
Teradata uses a cost-based optimizer and requires different metrics to assess the cost of a query, create other execution plans, and finally use the best plan. Statistics is one of the most important key figures and is information about the data demography of tables and indexes.
How the Optimizer uses Statistics
Execution Plans consist of individual steps that are executed sequentially or in parallel. For retrieve steps, statistics help the Optimizer choose the right access path (index or full table scan). Still, primary index access is always the first choice, even without statistics, as it is the most efficient.
Furthermore, statistics help the Optimizer to select the optimal join strategy. This includes the order of the joins, the join type (merge join, product join, etc.), and how the data is distributed to the AMPs (if necessary).
A join preparation step can duplicate rows to all AMPs or redistribute them by the row hash of the join columns. Some join types require the tables to be sorted by row hash. Each of these preparation steps causes costs that must be weighed against each other. Statistics support the Optimizer in this decision.
Statistics make the Optimizer create more efficient but riskier execution plans.
The hash join, for example, performs best if the smaller table of a join can be kept entirely in memory but has suboptimal performance if this is not the case. Without statistics, the Optimizer might avoid the hash join to mitigate the risk of a bad plan.
What are Collected Statistics
Statistics in Teradata are information about data such as the number of table rows, the number of rows per column value, the number of distinct column values, the average row length, the number of NULL values per column, and the skew factor of a column, and much more. The Optimizer uses statistics to generate the best possible execution plan. The Optimizer uses them to determine which indexes are used, which join strategy is used and in which order joins are executed. Statistics can be collected on single columns, multiple columns, and entire tables. Table statistics give the Optimizer different key figures than statistics on columns.
The definition and collection of required statistics are not automatic but must be defined by the user. Still, Teradata now offers a framework (called “Autostats feature”) to automatically create and schedule jobs that, on the one hand, determine which statistics the Optimizer considers beneficial and, on the other hand, automate the collection of statistics. This framework is based on Viewpoint (as a scheduler), various DBC tables (for analysis), and procedures and can significantly simplify the creation and maintenance of statistics.
Collected statistics about columns are stored in intervals of histograms and as so-called summary statistics for information concerning the whole table. Teradata tries to keep approximately the same number of different values in each interval. A column statistic can be a single column or a combination of columns. Intervals do not store the actual number of different column values but an approximation. For this purpose, the histogram is divided into 250 intervals by default, and in each interval, the information about a range of values is stored in compressed form. The stored key figures are the interval’s maximum value, the interval’s most frequent value, the number of different values in the interval (excluding the most frequent value), and how many rows contain these values. With 250 intervals, 0.4% of the rows in a table are in each interval.
The number of intervals can be reduced or increased. Multiple intervals make sense for skewed columns because several most frequent values can be stored. This can reduce the skew problem. Very skewed column values are not stored in the regular intervals but the number of rows for each value in the intervals for biased values. A histogram can therefore consist only of biased value intervals, only of equal-height intervals, or a combination of both.
Teradata uses only the number of intervals that are needed. If there are only ten different values in a column, only ten intervals will be created for the corresponding statistics; of course, we can reduce the number of intervals at any time, but this rarely makes sense.
The collection of statistics can be done for the whole table or a part of the table (to save resources). If only a part of the rows is used, this is called sample statistics.
Collected Statistics versus other Methods
In other articles, we discuss Teradata’s dynamic AMP sampling and sample statistics in detail. We refer to add the end of this article. Should you miss the basics, please consider reading these articles first.
All data collection considerations are based solely on one question: Are they sufficient to give the Optimizer a sufficiently good picture of the data demography? As we can imagine, this is closely related to the skew factor of a table. Here are a few rules to help us make the right decision:
1. Small tables always need collected statistics.
Imagine the following scenario: Our Teradata system consists of 300 AMPs, and our table contains five rows. A dynamic AMP sample is likely taken from an AMP that does not include rows. This will cause the Optimizer to assume that the table is empty. This should be avoided; a product join could be the result. In any case, the execution plan will not be optimal. Since collecting statistics on small tables is cheap, you should not waste a second thinking about not doing this.
2. On a Unique Index (Primary Index, Unique Secondary Index), one can usually rely on dynamic AMP sampling since an even distribution of the data is given by definition. Alternatively, one could collect sample statistics. This is to be considered for each case individually. The advantage of sample statistics could be that the Optimizer can create a more aggressive (and hopefully better performing) execution plan. Complete statistics should be avoided if the table is significant to avoid wasting resources.
Exception: if we are using non-equality constraints, we should always collect full statistics on them.
3. For indices that are not unique (NUPI, NUSI), the skew factor of the (index) table is again decisive if full statistics are needed. For NUSI, the Optimizer assumes that there is one subtable index row per value. If this is not the case, dynamic AMP sampling on the NUSI will give the wrong results.
4. On non-indexed columns used in query conditions (where conditions), collecting either full or sample statistics (again, depending on whether sample statistics sufficiently represent the data demography) is always advisable. This prevents the Optimizer from using heuristics.
Rules for Row Partitioned Tables
5. Collecting complete statistics on the dummy PARTITION column is an efficient way to give the optimizer information about empty partitions and the number of rows per partition. This information is used for the join costing of various join types (e.g., Sliding Window Merge Join).
6. We should always collect full or sample statistics on each partitioning column.
7. If the partition columns are not part of the Primary Index, we should additionally collect the following statistics:
PARTITION & Primary Index tells the Optimizer how many partitions a particular primary index value is available (the Optimizer uses this information in the case of dynamic partition elimination and sliding window merge joins).
PARTITION & Primary Index & each partitioning column: The Optimizer uses these statistics to estimate costs for row key-based joins.
Rules for multi-column statistics
8. As the Optimizer assumes independence for statistics collected on single columns, we should collect statistics on combinations of columns if there is a high correlation between them.