Over the years working with the Teradata System, I perceived many misunderstandings about how Teradata statistics are used by the optimizer and their impact on overall system performance.
I am regularly drawn into discussions about collecting statistics’ advantages and disadvantages. Over and over again, I face the claim that statistics collection may make things worse and has to be done very carefully.
In my experience, these claims are often based upon a wrong assumption, and I will argue against this assumption later in this article.
I agree that the statistics to be collected should be chosen wisely and based on best practice recommendations (as a few examples: primary index, partition columns, join conditions, where conditions) and necessity.
For example, why should you put statistics on a column with almost unique values with a perfect distribution? The random AMP samples probably will be good enough in such a case.
The collection of statistics is a resource-consuming task. This fact should influence our decisions when designing the statistics framework. Otherwise, vast amounts of resources may be wasted.
As you probably have been reading out already, I am not a big fan of putting statistics on each column (I have seen this being done a lot by so-called “Teradata Experts”)
As another example, multi-column statistics may not add any value to the optimizer’s decision, but the collection process usually needs more resources than statistics collected on single columns.
Nevertheless, there’s a stubborn rumor: Changes in the existing statistics landscape are dangerous and may degrade performance. Often this sounds like we would face a poorly designed optimizer, which is not the case.
While a performance decrease may come true after adding statistics, it is important not to misjudge cause and effect.
There is absolutely nothing mystical about how the Teradata optimizer uses statistics. I will not go into detail as you can read our other postings about how statistics work. Statistics inform the optimizer about the number of rows per table and many other aspects of data demographics.
However, what most people get wrong here, is not the addition of a specific statistics which may cause a performance decrease but a simple side effect:
By adding statistics, the optimizer may choose a different execution plan for a specific query, as it will include the additional statistics into its evaluation process. Unfortunately, this does not necessarily mean that the query will perform better. The opposite can be the case.
The new execution plan may lead to a different logical sequence of 2-table joins or may change the join type, and the new plan would require further statistics. Without having these statistics available (or just outdated ones), the new execution plan can lead to worse performance than the original one.
This is the point when I have seen many rumors arising about the danger of statistics, and actually, this is one of the topics which drives me crazy as I have to discuss it over and over again:
“We can’t change anything on the existing statistics. We tried it in the past but had bad results…”
To everybody who thinks like this, I have to say:
- Complete your statistics if you run into such problems. There’s probably something missing.
- Check for stale statistics.
The root cause of the problem is most likely on your side. Don’t toy with the idea of an optimizer bug or similar excuses (“Teradata is behaving strangely after I added statistics to this column”)
I would like to hear your opinion about this topic. Feel free to comment.
Hi Roland,
I always try collect stats at least on PIs of tables & on columns which are being frequntly used in joins/WHERE.
Is that fine as per the above disscussion .
Thanks,
Nagendra
Collecting full statistics is the best method in order to achieve the best possible execution plan.
We can’t always do it as the collecting of statistics is costly (Disk IOs, CPU time), but if
the costs are not a problem (for example, if you have a batch window for this task),
collecting statistics on all primary indexes and all WHERE conditions definitely is a good decision.
Having collected statistics for the WHERE condition is very important; If these statistics are missing, the Optimizer will use simple heuristics to determine the spool cardinality.
This is how the heuristic works:
The first WHERE conditions are always estimated to be 10% of the total table rows.
If, for example, our table contains 100,000 rows, the Optimizer assumes that 10,000 rows will be retrieved.
Be aware, the 10% rule is applied to the estimated table cardinality which is either determined by random-AMP sampling or available form collected primary index statistics. Skewed tables probably will lead to a heuristic, which is based on a wrong table cardinality estimation.
Therefore, statistics on WHERE conditions and primary index statistics work together!
Here are some examples:
— No statistic on the WHERE conditions available
SELECT * FROM The_Table WHERE col=10; — 10,000 rows are estimated, even if not a single row has a column value = 10 !
For each further ANDed column, 75% of the previous estimation are taken:
SELECT * FROM The_Table WHERE col=10 AND col2 = 20; — 10,000 * 75% = 7,500 rows are estimated
In case of ORed conditions, each column selected, accounts for 10%:
SELECT * FROM The_Table WHERE col=10 OR col2 = 20; — 10,000 + 10,000 = 20,000 rows are estimated
In case we have statistics on one or more of the WHERE condition columns, the one with the highest selectivity (which estimate the least rows to be returned) is applied:
COLLECT STATISTICS ON THE_TABLE COLUMN(col); — We assume the estimated cardinality is 5,000 rows
COLLECT STATISTICS ON THE_TABLE COLUMN(col2); — We assume the estimated cardinality is 10,000 rows
SELECT * FROM The_Table WHERE col=10 AND col2 = 20; — 5,000 * 75% = 3,750 rows are estimated
The statistics of column col (5,000 estimated rows) are used as a base for applying heuristics on col2.
The collected statistics of col2 are completely ignored!
The heuristics for columns, which are having collected statistics, are not always using the 75% rule.
If the value is skewed (being available more than 75% of the time), the real number is used.
For example, if the column col2 would contain the value = 20 in 90% of the rows, 90% would be used in the heuristic
calculation: 5,000 * 90% = 4,500
Another important fact is that as soon as any heuristics are included in the estimation,
the confidence level of the retrieving step is lowered to “no confidence”.
This will have an impact on the execution plan.
If you want to achieve high confidence, the Optimizer requires multi-column statistics:
COLLECT STATISTICS ON THE_TABLE COLUMN(col,col2);
Conclusion:
The statistics you mention absolutely make sense, but I would add as well statistic on the join column(s) of each join step, as this helps the Optimizer in join planning (order of join steps, join type, redistribution strategy, etc.)
I hope this helped
Roland
Hello Tom!
Thank you very much for the encouraging comment!
We are excited to receive such positive feedback from a master of the trade!
I suggest that we continue our discussion using my e-mail address:
[email protected]
Roland,
I really like the way you explain things. You hit important points without digressing into unimportant or confusing details. You write as I teach! I would like for you to consider co-authoring a book with me. You are that good!
Hi Roland,
I couldn’t agree more, there is so much misunderstanding in this area.
One of the most common things I hear is that “I collected statistics and it improved the confidence level. That is good isn’t it?”
The number of times I’ve had to explain to them that if the ** only ** change is confidence then the stats are a waste of time, they will not have improved performance. The following discussion is then about ‘structure of the explain plan’ with my finishing line being that “for statistics to be effective they MUST change the structure of the explain plan – otherwise they are a complete waste of resources”.
Cheers,
Dave
Confidence levels are a very good point, thanks, David.
Statistics are a topic which is close to my heart as in my opinion each performance optimization should start with a review of statistics (are statistics missing? are they stale?) but in my experience, this task is many times skipped for different reasons: Sometimes a lack of understanding, but sometimes the environment of the Data Warehouse does not allow to “experiment” with statistics.
I will give you an example: In one of my last projects, it was not possible to do proper performance optimization as only database administrators were allowed to add/remove statistics. Each required change needed at least one day. This actually forced me to skip statistics considerations completely and I had to rely on query rewriting only.
Another mistake I can see quite often is that people look only at the date of the last recollection and decide based on this date if statistics are stale.