The most crucial optimization task is to provide the Teradata Optimizer with all necessary statistics. Statistic collection should always be done first before doing any further optimization considerations.
On which columns which kind of statistics (Full Statistics, Random-AMP Sampling, or Sample Statistics) are required depends primarily on data demography.
Random-AMP sampling is usually sufficient for unskewed data, and if the table is large enough. The table should contain significantly more rows than there are AMPs in the system.
If the data are skewed, there is a high probability that the sample is from an AMP that is not representative of the data. Skewed data leads to an overestimation or underestimation of the number of rows.
Good candidates for Random-AMP sampling are unique indices, i.e., UPI and USI.
However, random AMP sampling only takes place on indexed columns. Therefore, we always must collect full statistics on columns listed in WHERE clauses.
Full statistics are also appropriate for skewed column values and columns with a small number of distinct values, such as NUPI and NUSI.
The Teradata Optimizer also has unique requirements regarding statistics when it comes to PPI tables.
Statistics should be collected here in any case:
Statistics on column “PARTITION” tell the optimizer how many partitions are empty.
PARTITION + PI column
These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions.
Statistics on column “PARTITION” + PI help the optimizer to estimate the cost of the sliding window and rowkey based merge join, and dynamic partition elimination.
We can make the following statement to determine what statistics the optimizer lacks:
DIAGNOSIS HELPSTATS ON FOR SESSION;
Above statement displays a list of missing statistics at the end of the Execution Plan (EXPLAIN statement) and the Optimizer’s estimation of their importance (Confidence Levels).
By gradually adding these statistics, you can test their influence on the improvement of the execution plan.
There are several ways to find old statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.
Above described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.
Here’s an example:
SELECT t01.* FROM <Tablename1> t01
table name2> t02
ON t01.PK = t02.PK
WHERE t01.<Column name> = 1 AND t02.<Column name> = 2;
The above query can be divided into two parts for testing:
SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;
SELECT * FROM <Tablename2> WHERE <Columnnname> = 2;
If you execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.