Collect Statistics in Teradata
Collect Stats in Teradata
As experienced Teradata tuning specialists, we are convinced that the two most important factors to obtain good system performance are the choice of a good primary index and the timely and appropriate collection of statistics.
Teradata uses a cost based optimizer. If we collect stats in Teradata, we provide the Optimizer with the data demographics it needs to generate good execution plans.
All cost based optimizers require statistical information about the data and hardware, such as CPU, Disk, Memory, and Processors.
The optimizer will create several plans to choose from and use the one with the lowest usage of resources. It will not evaluate each possible plan (which could be billions) but uses algorithms to identify the best candidates and chooses the best out of them.
The costs for each plan are derived from table row count of the physical tables and spools generated by operations such as selections, joins, and projections.
Accurate estimations are crucial to getting optimal plans. Providing statistical information for performance optimization is critical to optimal query plans.
Teradata distinguishes between statistics about tables and columns. Table statistics are for example the number of rows, the average row size and the number of rows per data block.
The column statistics are going more into detail, and store information about each column value (the column histograms) or the whole column (summary column statistics).
Summary column statistics are, for example, the number of distinct values, the lowest and the highest value.
Detailed column statistics are for example the number of rows per value. And information about the most common column values – which is used by the optimizer for skew detection.
The steps of interest for a Teradata Tuning Specialist are selections, joins, and aggregations.
We can think about retrieve steps as being the equivalent to the WHERE condition predicates of our query. “Aggregation steps” are generated by GROUP BY statements.
Statistics impact in which order Teradata executed joins and they determine which join strategy is used.
When we talk about join plan, we have to consider join types and data relocation steps:
Each time we join rows from two tables together, they have to be located on the same AMP. In the best case, both tables have the same primary index, and no data relocation is needed. In the worst case, both tables have to be rehashed by the join columns.
The Optimizer uses statistics to answer the following questions:
Is it cheaper to use an index or a full table scan? The Optimizer will always use the primary index or a Unique Secondary Index. There are no statistics considered, as it’s any way the cheapest way to access table rows.
But statistics help the optimizer to decide whether or not to do a full table scan or use a nonunique secondary index, a join index, or a hash index.
Join cost estimation for partitioned tables relies on statistic estimations on the partition columns and the so-called “dummy” partition column. We will explain this in detail later when we talk about best practices for statistic collection.
The optimizer has to find out which join type should be utilized (such as merge join, product join, hash join) and what’s the cheapest way to bring the rows from two tables to the same AMP.
Furthermore, it has to find out the optimal join order. The mentioned three variables wich have to be optimized at the same time give plenty of room to create a bad execution plan.
Only current statistics ensure that the optimizer can create the best execution plan possible.
As a join preparation step, rows can be duplicated to all AMPs or redistributed by the row hash of the join columns. Some join types require the tables to be sorted by row hash. Each of these preparation steps causes costs which have to weighted against each other.
Statistics allow the optimizer to take a higher risk when creating the execution plan and give it greater confidence in the estimation it uses.
Here is one example:
The hash join performs great if the smaller table of a join can be kept entirely in memory, but has bad performance if the memory is not sufficient. Without statistics, the optimizer may avoid the hash join. Information about table cardinalities and the available main memory are used to optimize the execution plan.
To summarize: Without statistics, query execution times would be unpredictable. They are the heart of the system!
See also: Teradata Statistics Basics