What are Teradata Statistics?
Statistics provide the Optimizer with the data demographics it needs to generate good execution plans. It will create several plans to choose from and use the one with the lowest cost of resources.
All cost based optimizers use statistical information about data and hardware.
Teradata uses statistical information about tables (summary statistics) and columns. Table metrics are for example the number of rows, the average row size, and the number of rows per data block.
Statistics about columns are more detailed than summary statistics. The number of rows per value, the number of distinct values, and the lowest and highest value. Information about the most common column values used by the optimizer for join planning.[su_panel]Statistics are used to improve the execution plans.[/su_panel]
An execution plan consists of join-, retrieve- or aggregation steps. Retrieve steps are caused by WHERE conditions. Aggregation steps by GROUP BY statements.
Statistics impact in which order the Optimizer executes the plan steps and which technique it uses.
The Optimizer uses statistics to answer the following questions:[su_panel]Retrieve Steps:
Is it cheaper to use an index or a full table scan? The Optimizer will quickly choose and use any primary index, and it will also easily pick a Unique Secondary Index, but statistics help the optimizer to decide whether to do a full table scan or use a nonunique secondary index and if nonunique secondary index bit mapping should be utilized.[/su_panel] [su_panel]Join Steps:
Which join type should be used (such as merge join, product join, hash join, etc.). When two rows are joined, they must be located on the same AMP (this means that the join columns have to be the primary index of both tables). The Optimizer has to find out what’s the cheapest way to bring the rows together: For example, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Last but not least, the optimizer has to decide about the ideal join order.[/su_panel]
Above you saw a few examples how statistics influence the execution plan.
Statistics allow the Optimizer to create more risky execution plans by increasing the confidence in its estimations.
As an example, let’s consider the hash join which performs great if the smaller table can be kept entirely in memory, but has bad performance if the memory is not enough.
Without collected statistics, the hash join might be avoided to mitigate the risk of bad performance.
Collected statistics are only a starting point for estimations. They are adjusted throughout the optimization process when more information from various database constraints such as CHECK constraints, referential integrity constraints, and query predicates are available. Adjusted estimations are called the “derived statistics”.
The Optimizer further refines the derived statistics each time it gets new details from:
– A join step
– Single table predicates (a WHERE condition)
– An Aggregation step
Here is one example:
Assume we have a WHERE condition like this one: “WHERE column IN (1,2,3). The Optimizer derives that the maximum number of distinct values of the result set will be 3 and uses this enhanced piece of information from now on to improve the estimations of the next execution steps.
Without collected statistics, the optimizer solely relies on random-AMP sampling when the Optimizer builds the execution plan. This approach gives good estimations if the data is evenly distributed (unique or nearly unique). Random-AMP sampling results in wrong estimates for skewed data.