The collection of statistics is one of the most important tasks on a Teradata system, as proper statistics ensure consistent and excellent performance.
This post will give you all the details needed to create an optimized statistics landscape. While the information provided is related to Teradata Release 14.10, most of the ideas can be applied in the same way to any previous release of Teradata.
Teradata Statistics – Full or Sample?
The decision in which scenarios complete statistics should be used, and in which cases sample statistics are sufficient, is mainly driven by one consideration: How likely is it, that the optimizer’s estimation about data demographics is wrong? Expect the risk of a wrong estimation to be high for skewed tables.
1. Tiny tables always require fully collected statistics.
Imagine the following scenario: You system consists of 300 AMPs, but your small table contains only five rows. As you may recognize immediately, the risk is quite high that the random AMP sample is taken from an AMP which is not containing any row at all.
As a logical consequence, the optimizer will assume that the table is empty and will create a bad execution plan. Since the collection of statistics on small tables only consumes little resources, never spend a single thought about not collecting full statistics!
2. In the case of any unique index (UPI, USI) it is sufficient to collect sample statistics. You might even decide to omit statistics for huge tables at all (remember: gathering statistics may require plenty of resources). Runtime or random AMP sampling gives the optimizer an undistorted picture about data demographics. Nevertheless, the execution plan might be more conservative in the case of random AMP sampling.
One exception: if you are using non-equality constraints, you should always collect full statistics on the constraint column(s).
3. In the case of any not unique index (NUPI, NUSI) we are back again to assess the skewness of the table. If the table is evenly distributed across all AMPs, sample statistics may be sufficient.
4. You should always go for full statistics on any columns used in predicates.
5. Collecting complete statistics on the dummy PARTITION column is an efficient way to give the optimizer information about empty partitions and the number of rows per partition. You should always collect them.
6. You should always collect statistics on each partitioning column.
7. If the partition columns are not part of the Primary Index, you should additionally collect the following statistics:
PARTITION & Primary Index, as this tells the optimizer in how many partitions a particular primary index value is available. The optimizer uses this information in the case of dynamic partition elimination and sliding window merge joins.
PARTITION & Primary Index & each partitioning column: The optimizer uses this statistics to estimate costs for row key based joins.
8. As the optimizer assumes independence for statistics collected on single columns, it makes sense to gather statistics on columns together if there is a high correlation between them. The higher the correlation, the bigger is the advantage of collecting multi-column statistics.
Final words: In opposition to what its title might lead one to conclude, the list only contains eight topics. People love to read “Top 10 Lists” and this helps us to rank high in Google 😉