This article details the significance of the gathered Teradata Statistics and the considerations to keep in mind while utilizing them. Additionally, we present practical guidance on the appropriate application of collected statistics and the other options that are accessible.
Teradata’s cost-based optimizer utilizes various metrics to evaluate the expense of a query, generate alternate execution plans, and ultimately employ the most efficient plan. Among these metrics, statistics are crucial as they provide valuable insights into the demographic characteristics of tables and indexes.
How the Optimizer Uses Statistics
Execution Plans include sequential or parallel steps. When retrieving data, statistics assist the Optimizer in selecting the optimal access path, such as an index or full table scan. Nonetheless, due to its superior efficiency, primary index access is the ideal option and is prioritized over other options, even when statistics are not available.
Statistics aid the Optimizer in determining the optimal join strategy, encompassing the sequence of joins, join category (for instance, merge join, product join), and AMP distribution (if required) of data.
A join preparation step can duplicate rows to all AMPs or redistribute them by the row hash of the join columns. Some types of joins require the tables to be sorted by row hash. Each of these preparation steps causes costs that must be weighed against each other. Statistics support the Optimizer in this decision.
Utilizing statistics enables the Optimizer to generate execution plans that are more efficient yet riskier.
The hash join is most efficient when the smaller table can fit entirely in memory, but its performance is suboptimal when this is not possible. Without statistics, the Optimizer may opt out of using the hash join to minimize the possibility of a poor plan.
What are Collected Statistics
Teradata statistics provide crucial information about data, including table row count, rows per column value, distinct column values, average row length, NULL values per column, and column skew factor. The Optimizer utilizes this information to generate an effective execution plan, determine index usage, and join strategy and order. Statistics can be collected for single, multiple columns, and entire tables, with the Optimizer deriving unique key figures from table statistics instead of column statistics.
The definition and collection of required statistics are not automatic but must be defined by the user. Still, Teradata now offers a framework (called “Autostats feature”) to automatically create and schedule jobs that, on the one hand, determine which statistics the Optimizer considers beneficial and, on the other hand, automate the collection of statistics. This framework is based on Viewpoint (as a scheduler), various DBC tables (for analysis), and procedures and can significantly simplify the creation and maintenance of statistics.
Teradata stores statistics for columns in histograms and summary statistics for the entire table. The goal is to maintain roughly the same number of values in each interval. A column statistic can include one or more columns. Rather than actual values, intervals store approximations.
By default, the histogram is divided into 250 intervals, each containing information about a compressed range of values. The stored information includes the interval’s maximum and most frequent values, the number of values in the interval (excluding the most frequent), and the number of rows containing those values. With 250 intervals, each interval contains approximately 0.4% of the table’s rows.
The number of intervals may vary. For columns with skewed data, having several intervals to hold the most frequent values may be advantageous, thereby decreasing skewness. Intervals that contain biased values do not store extremely skewed values; instead, they record the number of rows corresponding to each value within the interval. As such, a histogram could comprise solely of intervals containing biased values, entirely of intervals with equal heights, or a mixture of both.
Teradata creates intervals as needed. For instance, if a column has only ten distinct values, only ten intervals will be generated for the corresponding statistics. Although decreasing the number of intervals is possible, this is seldom practical.
Statistics can be gathered for the entire table or a portion of it to conserve resources. When only a subset of the rows is utilized, it is called sample statistics.
Collected Statistics versus Other Methods
For a detailed discussion on Teradata’s dynamic AMP sampling and sample statistics, please refer to our previous articles located at the end of this post. If you are unfamiliar with these topics, we recommend reading those articles before continuing.
Data collection is contingent upon providing the Optimizer with a comprehensive understanding of the data demography. This closely correlates with the skew factor of a table. To ensure accurate data collection, adhere to the following guidelines:
1. Small tables always need collected statistics.
In a scenario where our Teradata system is composed of 300 AMPs and our table has only five rows, a dynamic AMP sample may be taken from an AMP that lacks any rows, leading the Optimizer to assume that the table is empty. This circumstance must be avoided as it may result in a product join and a suboptimal execution plan. Given the low cost of collecting statistics for small tables, there is no reason to hesitate in doing so.
2. On a Unique Index (Primary Index, Unique Secondary Index), one can usually rely on dynamic AMP sampling since an even distribution of the data is given by definition. Alternatively, one could collect sample statistics. This is to be considered for each case individually. The advantage of sample statistics could be that the Optimizer can create a more aggressive (and hopefully better-performing) execution plan. Complete statistics should be avoided if the table is significant to avoid wasting resources.
If non-equality constraints are utilized, it is imperative to gather complete statistics on them.
3. For indices that are not unique (NUPI, NUSI), the skew factor of the (index) table is again decisive if full statistics are needed. For NUSI, the Optimizer assumes that there is one subtable index row per value. If this is not the case, dynamic AMP sampling on the NUSI will give the wrong results.
4. On non-indexed columns used in query conditions (where conditions), collecting either full or sample statistics (again, depending on whether sample statistics sufficiently represent the data demography) is always advisable. This prevents the Optimizer from using heuristics.
Rules for Row Partitioned Tables
5. Collecting complete statistics on the dummy PARTITION column efficiently gives the optimizer information about empty partitions and the number of rows per partition. This information is used for the join costing of various join types (e.g., Sliding Window Merge Join).
6. We should always collect full or sample statistics on each partitioning column.
7. If the partition columns are not part of the Primary Index, we should additionally collect the following statistics:
The Partition and Primary Index provides the Optimizer with the number of available partitions for a given primary index value. The Optimizer utilizes this information for dynamic partition elimination and sliding window merge joins.
The Optimizer utilizes statistics from PARTITION, Primary Index, and each partitioning column to estimate costs for row key-based joins.
Rules for multi-column statistics
8. As the Optimizer assumes independence for statistics collected on single columns, we should collect statistics on combinations of columns if there is a high correlation between them.