What are Teradata Statistics?
Statistics provide the Optimizer with the data demographics it needs to generate good execution plans. It will create several plans to choose from and use the one with the lowest cost of resources.
All cost based optimizers use statistical information about data and hardware.
Teradata uses statistical information about tables (summary statistics) and columns. Table metrics are for example the number of rows, the average row size, and the number of rows per data block.
Statistics about columns are more detailed than summary statistics. The number of rows per value, the number of distinct values, and the lowest and highest value. Information about the most common column values used by the optimizer for join planning.
An execution plan consists of join-, retrieve- or aggregation steps. Retrieve steps are caused by WHERE conditions. Aggregation steps by GROUP BY statements.
Statistics impact in which order the Optimizer executes the plan steps and which technique it uses.
The Optimizer uses statistics to answer the following questions:
Above you saw a few examples how statistics influence the execution plan.
Statistics allow the Optimizer to create more risky execution plans by increasing the confidence in its estimations.
As an example, let’s consider the hash join which performs great if the smaller table can be kept entirely in memory, but has bad performance if the memory is not enough.
Without collected statistics, the hash join might be avoided to mitigate the risk of bad performance.
Collected statistics are only a starting point for estimations. They are adjusted throughout the optimization process when more information from various database constraints such as CHECK constraints, referential integrity constraints, and query predicates are available. Adjusted estimations are called the “derived statistics”.
The Optimizer further refines the derived statistics each time it gets new details from:
– A join step
– Single table predicates (a WHERE condition)
– An Aggregation step
Here is one example:
Assume we have a WHERE condition like this one: “WHERE column IN (1,2,3). The Optimizer derives that the maximum number of distinct values of the result set will be 3 and uses this enhanced piece of information from now on to improve the estimations of the next execution steps.
Without collected statistics, the optimizer solely relies on random-AMP sampling when the Optimizer builds the execution plan. This approach gives good estimations if the data is evenly distributed (unique or nearly unique). Random-AMP sampling results in wrong estimates for skewed data.