Introduction
As a seasoned Teradata professional, I have encountered several misconceptions about how the system’s optimizer utilizes statistics and their repercussions on the overall system performance. These misunderstandings may result in suboptimal decisions while collecting and managing statistics, thus impacting the Teradata system’s efficiency and efficacy. This comprehensive guide aims to dispel these misconceptions, provide insights on best practices for statistics collection, and examine the actual reasons for performance degradation.
Misconceptions about Teradata Statistics
Collecting statistics can potentially harm system performance, but some experts advise caution. Despite this caution, statistics actually aid in the optimization process, resulting in enhanced performance in most cases.
Including statistics can be risky: It is a fallacy that appending new statistics may cause a decline in performance, ultimately altering the decision-making process of the optimizer. Although statistics can influence the optimizer’s choices, they are not inherently hazardous, provided that they are pertinent and current.
The optimizer’s design is often misunderstood. Some professionals may mistakenly believe it is inadequate due to performance issues resulting from adding or modifying statistics. However, the optimizer is a highly intricate part of the Teradata system, engineered to make optimal decisions with available information. In reality, the problem stems from providing precise and pertinent statistics, which we will cover in detail later in this guide.
Best Practices in Statistics Collection
Careful consideration of statistics selection is crucial. Best practice recommendations and necessities should guide the selection process. Key columns to gather statistics include a primary index, partition columns, join conditions, and where conditions.
Do not gather statistics in every column. Despite the advice of supposed Teradata specialists, it is not advisable. Rather, concentrate on gathering statistics for columns that significantly affect the optimizer’s decision-making process.
Exercise caution when dealing with multi-column statistics. Although these statistics can be useful to the optimizer, they may also require more resources than single-column statistics. Therefore, limit the use of multi-column statistics only when necessary and beneficial to the optimizer’s decisions.
It is important to keep statistics up-to-date as stale or outdated statistics can lead to suboptimal decisions by the optimizer. Updating statistics regularly is crucial to ensure they accurately reflect the current state of your data.
Continuously monitor and review your statistics strategy for optimal Teradata system performance. Adjust as necessary.
The Real Cause of Performance Degradation
Adding or modifying statistics could potentially cause a decline in performance, but comprehending the cause and effect is crucial. The optimizer relies on statistics to make informed decisions and select the most optimal execution plan for a given query. Nonetheless, introducing new statistics may prompt the optimizer to opt for an alternative execution plan that may not necessarily result in better performance.
The problem occurs when the updated execution plan alters the logical order of 2-table joins or includes a different join type, necessitating updated statistics. Outdated or inadequate statistics may cause the new plan to underperform compared to the initial plan.
Misconceptions often arise surrounding the fear of altering existing statistics, leading professionals to avoid changing their statistics landscape for fear of worsening the optimizer’s decisions. It is important to remember that encountering such problems necessitates taking action.
- Complete your statistics if necessary, as there’s probably something missing.
- Check for stale statistics and update them as needed. Stale statistics can mislead the optimizer, resulting in suboptimal execution plans and poor performance.
- Investigate potential issues within your data model or system configuration. Don’t assume that the optimizer is faulty or that Teradata is “behaving strangely” after adding statistics to a column. The root cause of the problem is likely on your side, and it’s your responsibility to address it.
- Educate yourself and your team on how the Teradata optimizer works and the importance of statistics. Understanding the underlying concepts and mechanisms can help you make better decisions when collecting and managing statistics.
- Collaborate with other Teradata professionals and share your experiences. Learning from others can provide valuable insights and help you develop more effective strategies for managing your Teradata system’s performance.
- Be proactive in addressing performance issues. Don’t wait for problems to arise before taking action. Regularly review and optimize your statistics landscape to ensure the best possible performance.
- Consider using advanced tools and techniques to manage your statistics, such as Teradata’s automated statistics management features. These tools can help you maintain up-to-date and relevant statistics with minimal manual intervention, allowing you to focus on other critical aspects of your system.
- Continuously improve your understanding of Teradata’s capabilities and limitations. As with any technology, staying informed about the latest developments and best practices is essential to optimizing your system’s performance.
Conclusion
Teradata statistics are vital for the optimizer’s decision-making process and significantly affect your system’s overall performance. Misconceptions regarding statistics utilization and management can cause suboptimal decisions, reducing your system’s efficiency. You can guarantee optimal Teradata system performance by comprehending the Teradata optimizer’s functioning, adhering to statistics collection best practices, and tackling the actual performance degradation causes.
Please share your insights, experiences, and queries regarding Teradata statistics and performance optimization in the comments. Through candid and constructive conversations, we can collaborate and formulate better approaches to manage our Teradata systems.
Hi Roland,
I always try collect stats at least on PIs of tables & on columns which are being frequntly used in joins/WHERE.
Is that fine as per the above disscussion .
Thanks,
Nagendra
Collecting full statistics is the best method in order to achieve the best possible execution plan.
We can’t always do it as the collecting of statistics is costly (Disk IOs, CPU time), but if
the costs are not a problem (for example, if you have a batch window for this task),
collecting statistics on all primary indexes and all WHERE conditions definitely is a good decision.
Having collected statistics for the WHERE condition is very important; If these statistics are missing, the Optimizer will use simple heuristics to determine the spool cardinality.
This is how the heuristic works:
The first WHERE conditions are always estimated to be 10% of the total table rows.
If, for example, our table contains 100,000 rows, the Optimizer assumes that 10,000 rows will be retrieved.
Be aware, the 10% rule is applied to the estimated table cardinality which is either determined by random-AMP sampling or available form collected primary index statistics. Skewed tables probably will lead to a heuristic, which is based on a wrong table cardinality estimation.
Therefore, statistics on WHERE conditions and primary index statistics work together!
Here are some examples:
— No statistic on the WHERE conditions available
SELECT * FROM The_Table WHERE col=10; — 10,000 rows are estimated, even if not a single row has a column value = 10 !
For each further ANDed column, 75% of the previous estimation are taken:
SELECT * FROM The_Table WHERE col=10 AND col2 = 20; — 10,000 * 75% = 7,500 rows are estimated
In case of ORed conditions, each column selected, accounts for 10%:
SELECT * FROM The_Table WHERE col=10 OR col2 = 20; — 10,000 + 10,000 = 20,000 rows are estimated
In case we have statistics on one or more of the WHERE condition columns, the one with the highest selectivity (which estimate the least rows to be returned) is applied:
COLLECT STATISTICS ON THE_TABLE COLUMN(col); — We assume the estimated cardinality is 5,000 rows
COLLECT STATISTICS ON THE_TABLE COLUMN(col2); — We assume the estimated cardinality is 10,000 rows
SELECT * FROM The_Table WHERE col=10 AND col2 = 20; — 5,000 * 75% = 3,750 rows are estimated
The statistics of column col (5,000 estimated rows) are used as a base for applying heuristics on col2.
The collected statistics of col2 are completely ignored!
The heuristics for columns, which are having collected statistics, are not always using the 75% rule.
If the value is skewed (being available more than 75% of the time), the real number is used.
For example, if the column col2 would contain the value = 20 in 90% of the rows, 90% would be used in the heuristic
calculation: 5,000 * 90% = 4,500
Another important fact is that as soon as any heuristics are included in the estimation,
the confidence level of the retrieving step is lowered to “no confidence”.
This will have an impact on the execution plan.
If you want to achieve high confidence, the Optimizer requires multi-column statistics:
COLLECT STATISTICS ON THE_TABLE COLUMN(col,col2);
Conclusion:
The statistics you mention absolutely make sense, but I would add as well statistic on the join column(s) of each join step, as this helps the Optimizer in join planning (order of join steps, join type, redistribution strategy, etc.)
I hope this helped
Roland
Hello Tom!
Thank you very much for the encouraging comment!
We are excited to receive such positive feedback from a master of the trade!
I suggest that we continue our discussion using my e-mail address:
[email protected]
Roland,
I really like the way you explain things. You hit important points without digressing into unimportant or confusing details. You write as I teach! I would like for you to consider co-authoring a book with me. You are that good!
Hi Roland,
I couldn’t agree more, there is so much misunderstanding in this area.
One of the most common things I hear is that “I collected statistics and it improved the confidence level. That is good isn’t it?”
The number of times I’ve had to explain to them that if the ** only ** change is confidence then the stats are a waste of time, they will not have improved performance. The following discussion is then about ‘structure of the explain plan’ with my finishing line being that “for statistics to be effective they MUST change the structure of the explain plan – otherwise they are a complete waste of resources”.
Cheers,
Dave
Confidence levels are a very good point, thanks, David.
Statistics are a topic which is close to my heart as in my opinion each performance optimization should start with a review of statistics (are statistics missing? are they stale?) but in my experience, this task is many times skipped for different reasons: Sometimes a lack of understanding, but sometimes the environment of the Data Warehouse does not allow to “experiment” with statistics.
I will give you an example: In one of my last projects, it was not possible to do proper performance optimization as only database administrators were allowed to add/remove statistics. Each required change needed at least one day. This actually forced me to skip statistics considerations completely and I had to rely on query rewriting only.
Another mistake I can see quite often is that people look only at the date of the last recollection and decide based on this date if statistics are stale.