The Big Misunderstanding With Teradata Statistics
I am regulary drawn into discussions about the advantage and disadvantage of collecting statistics. Over and over again, I face the claim, that statistics collection may make things worse and has to be done very carefully.
In my experience, most times this claims are based upon a wrong assumption and i will argue against this assumption later in this article.
I absolutely agree, that the statistics to be collected should be choosen wisely and based on best practice recommendations (as a few examples: primary index, partition columns, join conditions, where conditions) and necessity.
For example: Why should you put statistics on a column which has almost unique values with a perfect distribution? The random AMP samples probably will be good enough in such a case.
The collection of statistics is a resource consuming task, this fact should be influencing our decisions when designing the statistics framework. Otherwise, huge amounts of resources may be wasted.
As you probably have been reading out already, I am definitely not a big fan of putting statistics on each and every column (I have seen this being done a lot by so-called “Teradata Experts”)
As another example, multi-column statistics may not add any value to the optimizer's decision, but the collection process usually needs more resources than statistics collected on single columns.
Nevertheless, there's a stubborn rumour going round: Changes on the existing statistics landscape are dangerous and may degrade performance. Often this sounds like we would face a badly designed optimizer, which is definitely not the case.
While a performance decrease may come true after adding statistics, it is very important not to misjudge cause and effect.
There is absolutely nothing mystical about how statistics are used by the Teradata optimizer. I will not go into detail as you can read our other postings about how statistics work. Statistics inform the optimizer about the number of rows per table and many other aspects of data demographics.
However, what most people get wrong here, it is not the addition of a certain statistics which may cause performance decrease, but a simple side effect:
By adding statistics, the optimizer may chose a different execution plan for a certain query, as it will include the additional statistics into its evaluation process. Unfortunately, this does not necessarily mean that the query will perform better. The opposite can be the case.
The new execution plan may lead to a different logical sequence of 2-table joins or may change the join type and the new plan would require further statistics. Without having this statistics available (or just outdated ones), the new execution plan can lead to a worse performance than the original one.
This is the point when I have seen many rumours arising about the danger of statistic and actually this is one of the topics which really drives me crazy as I have to discuss it over and over again:
“We can't change anything on the existing statistics. We tried it in the past but had bad results…”
To everybody who thinks like this I have to say:
- Complete your statistics if you run in such problems. There's probably something missing.
- Check for stale statistics.
The root cause of the problem is most likely on your side. Don't toy with the idea of an optimizer bug or similar excuses (“Teradata is behaving strange after I added statistics to this column”)
I would really like to hear your oppinion about this topic. Feel free to comment.