Over the years working with the Teradata System, I perceived many misunderstandings about how Teradata statistics are used by the optimizer and their impact on overall system performance.
I am regularly drawn into discussions about collecting statistics’ advantages and disadvantages. Over and over again, I face the claim that statistics collection may make things worse and has to be done very carefully.
In my experience, these claims are often based upon a wrong assumption, and I will argue against this assumption later in this article.
I agree that the statistics to be collected should be chosen wisely and based on best practice recommendations (as a few examples: primary index, partition columns, join conditions, where conditions) and necessity.
For example, why should you put statistics on a column with almost unique values with a perfect distribution? The random AMP samples probably will be good enough in such a case.
The collection of statistics is a resource-consuming task. This fact should influence our decisions when designing the statistics framework. Otherwise, vast amounts of resources may be wasted.
As you probably have been reading out already, I am not a big fan of putting statistics on each column (I have seen this being done a lot by so-called “Teradata Experts”)
As another example, multi-column statistics may not add any value to the optimizer’s decision, but the collection process usually needs more resources than statistics collected on single columns.
Nevertheless, there’s a stubborn rumor: Changes in the existing statistics landscape are dangerous and may degrade performance. Often this sounds like we would face a poorly designed optimizer, which is not the case.
While a performance decrease may come true after adding statistics, it is important not to misjudge cause and effect.
There is absolutely nothing mystical about how the Teradata optimizer uses statistics. I will not go into detail as you can read our other postings about how statistics work. Statistics inform the optimizer about the number of rows per table and many other aspects of data demographics.
However, what most people get wrong here, is not the addition of a specific statistics which may cause a performance decrease but a simple side effect:
By adding statistics, the optimizer may choose a different execution plan for a specific query, as it will include the additional statistics into its evaluation process. Unfortunately, this does not necessarily mean that the query will perform better. The opposite can be the case.
The new execution plan may lead to a different logical sequence of 2-table joins or may change the join type, and the new plan would require further statistics. Without having these statistics available (or just outdated ones), the new execution plan can lead to worse performance than the original one.
This is the point when I have seen many rumors arising about the danger of statistics, and actually, this is one of the topics which drives me crazy as I have to discuss it over and over again:
“We can’t change anything on the existing statistics. We tried it in the past but had bad results…”
To everybody who thinks like this, I have to say:
- Complete your statistics if you run into such problems. There’s probably something missing.
- Check for stale statistics.
The root cause of the problem is most likely on your side. Don’t toy with the idea of an optimizer bug or similar excuses (“Teradata is behaving strangely after I added statistics to this column”)
I would like to hear your opinion about this topic. Feel free to comment.