Collecting statistics is one of the most important tools available to us for performance tuning. However, we have to think about many things because collecting statistics is always connected to resource consumption. The main considerations we need to make are:
- On which columns should statistics be collected?
- What kind of statistics should be collected (full statistics, sample statistics or is random AMP sampling sufficient)?
- How often should statistics be collected?
The collection of statistics should always be done first; only then should further tuning be considered. By collecting statistics, we give the optimizer the ability to create the best possible execution plan. Based on this, further steps should then be taken if necessary. Often the collection of statistics is the only tuning action needed. There is a command in Teradata which helps to determine which columns statistics the Optimized considers as helpful:
What Is DIAGNOSTIC HELPSTATS ON FOR SESSION?
Using the command “DIAGNOSTIC HELPSTATS ON FOR SESSION” we enable an option that displays the statistics required by the optimizer at the end of each execution plan (when the EXPLAIN command is used). Below is an example:
DIAGNOSTIC HELPSTATS ON FOR SESSION; EXPLAIN SELECT * FROM Customer WHERE CustomerId IN ( SELECT CustomerId FROM Sales ); BEGIN RECOMMENDED STATS FOR THIS Relation-> -- "COLLECT STATISTICS COLUMN (CustomerId) ON DWHPRO.CUstomer" (High Confidence) -- "COLLECT STATISTICS COLUMN ( CustomerId ) ON DWHPRO. Sales" (High Confidence) <-END RECOMMENDED STATS
How we test the benefits of the proposed statics is left up to us. We could add one statistic after the other and check our query immediately for changes, or we could add all the suggested statistics at once and then remove them one by one, leaving only the ones which caused a better execution plan.
My Personal Approach
Personally, I prefer to add all statistics with “high confidence” and concentrating mainly on single-column statistics. Next, I test the suggested multi-column statistics with “high confidence” and only at the very end, I consider the suggestions for statistics with “medium confidence”.
Traps When Using DIAGNOSTIC HELPSTATS ON FOR SESSION
Not all of the proposed statistics will improve the execution plan. Many of them can be redundant so that they do not change the execution plan but only cause additional costs for collecting them.
A disadvantage of DIAGNOSTIC HELPSTATS ON FOR SESSION is that it only tells us which statistics to collect, but not how to (full or sample); further analysis is needed to get all the details.
What is the Teradata Command HELPSTATS ON FOR SESSION?
Should statistics always be collected for all proposals?
No, because collecting statistics consumes resources. We only collect statistics if they improve our queries.
Does this command give us information about which statistics are more important than others?
Yes, by including the confidence level: “high confidence”, “medium confidence”, “low confidence”.
How do we know what kind of statistics to create?
This information is not provided. So it is up to us to determine if a column is skewed and then choose the appropriate type of statistic.
Does DIAGNOSTIC HELPSTATS ON FOR SESSION provide all required statistics for a table?
No, the proposed statistics only apply to the SQL query in question