A Comprehensive Guide To Teradata Statistics: Common Misconceptions, Best Practices, And Real Reasons Behind Performance Degradation

Table of Contents

Introduction

As a seasoned Teradata professional, I have encountered several misconceptions about how the system’s optimizer utilizes statistics and their repercussions on overall system performance. These misunderstandings may result in suboptimal decisions when collecting and managing statistics, thus impacting the Teradata system’s efficiency and efficacy. This comprehensive guide aims to dispel these misconceptions, provide insights on best practices for statistics collection, and examine the actual reasons for performance degradation.

Misconceptions about Teradata Statistics

Collecting statistics can potentially harm system performance, but some experts advise caution. Despite this caution, statistics actually aid in the optimization process, resulting in enhanced performance in most cases.

Including statistics can be risky: It is a fallacy that adding new statistics may cause a decline in performance, ultimately altering the decision-making process of the optimizer. Although statistics can influence the optimizer’s choices, they are not inherently hazardous, provided that they are relevant and current.

The optimizer’s design is often misunderstood. Some professionals may mistakenly believe it is inadequate due to performance issues resulting from adding or modifying statistics. However, the optimizer is a highly intricate part of the Teradata system, engineered to make optimal decisions with the available information. In reality, the problem stems from not providing precise and relevant statistics, which we will cover in detail later in this guide.

Best Practices in Statistics Collection

Careful consideration of statistics selection is crucial. Best practice recommendations should guide the selection process. Key columns for which to gather statistics include primary index columns, partition columns, join conditions, and WHERE clause conditions.

Do not gather statistics on every column. Despite the advice of some supposed Teradata specialists, this is not advisable. Instead, focus on gathering statistics for columns that significantly affect the optimizer’s decision-making process.

Exercise caution when dealing with multi-column statistics. Although these statistics can be useful to the optimizer, they may also require more resources than single-column statistics. Therefore, limit the use of multi-column statistics only when necessary and beneficial to the optimizer’s decisions.

It is important to keep statistics up to date, as stale or outdated statistics can lead to suboptimal decisions by the optimizer. Updating statistics regularly is crucial to ensure they accurately reflect the current state of your data.

Continuously monitor and review your statistics strategy for optimal Teradata system performance. Adjust as necessary.

The Real Cause of Performance Degradation

Adding or modifying statistics could potentially cause a decline in performance, but understanding the cause and effect is crucial. The optimizer relies on statistics to make informed decisions and select the most optimal execution plan for a given query. Nonetheless, introducing new statistics may prompt the optimizer to choose an alternative execution plan that does not necessarily result in better performance.

The problem occurs when the updated execution plan alters the logical order of 2-table joins or includes a different join type, requiring updated statistics to support it. Outdated or inadequate statistics may cause the new plan to underperform compared to the original plan.

Misconceptions often arise surrounding the fear of altering existing statistics, leading professionals to avoid changing their statistics landscape for fear of worsening the optimizer’s decisions. It is important to remember that encountering such problems requires taking action.

Complete your statistics if necessary, as there’s probably something missing.

Check for stale statistics and update them as needed. Stale statistics can mislead the optimizer, resulting in suboptimal execution plans and poor performance.
Investigate potential issues within your data model or system configuration. Don’t assume that the optimizer is faulty or that Teradata is “behaving strangely” after adding statistics to a column. The root cause of the problem is likely on your side, and it’s your responsibility to address it.
Educate yourself and your team on how the Teradata optimizer works and the importance of statistics. Understanding the underlying concepts and mechanisms can help you make better decisions when collecting and managing statistics.
Collaborate with other Teradata professionals and share your experiences. Learning from others can provide valuable insights and help you develop more effective strategies for managing your Teradata system’s performance.
Be proactive in addressing performance issues. Don’t wait for problems to arise before taking action. Regularly review and optimize your statistics landscape to ensure the best possible performance.
Consider using advanced tools and techniques to manage your statistics, such as Teradata’s automated statistics management features. These tools can help you maintain up-to-date and relevant statistics with minimal manual intervention, allowing you to focus on other critical aspects of your system.
Continuously improve your understanding of Teradata’s capabilities and limitations. As with any technology, staying informed about the latest developments and best practices is essential to optimizing your system’s performance.

Conclusion

Teradata statistics are vital for the optimizer’s decision-making process and significantly affect your system’s overall performance. Misconceptions regarding statistics utilization and management can cause suboptimal decisions, reducing your system’s efficiency. You can ensure optimal Teradata system performance by understanding how the Teradata optimizer works, adhering to statistics collection best practices, and addressing the actual causes of performance degradation.

Please share your insights, experiences, and questions regarding Teradata statistics and performance optimization in the comments. Through candid and constructive conversations, we can collaborate and develop better approaches to managing our Teradata systems.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

6 thoughts on “A Comprehensive Guide to Teradata Statistics: Common Misconceptions, Best Practices, and Real Reasons Behind Performance Degradation”

Hi Roland,

I always try collect stats at least on PIs of tables & on columns which are being frequntly used in joins/WHERE.

Is that fine as per the above disscussion .

Thanks,
Nagendra

Collecting full statistics is the best method in order to achieve the best possible execution plan.

We can’t always do it as the collecting of statistics is costly (Disk IOs, CPU time), but if
the costs are not a problem (for example, if you have a batch window for this task),
collecting statistics on all primary indexes and all WHERE conditions definitely is a good decision.

Having collected statistics for the WHERE condition is very important; If these statistics are missing, the Optimizer will use simple heuristics to determine the spool cardinality.

This is how the heuristic works:

The first WHERE conditions are always estimated to be 10% of the total table rows.

If, for example, our table contains 100,000 rows, the Optimizer assumes that 10,000 rows will be retrieved.
Be aware, the 10% rule is applied to the estimated table cardinality which is either determined by random-AMP sampling or available form collected primary index statistics. Skewed tables probably will lead to a heuristic, which is based on a wrong table cardinality estimation.

Therefore, statistics on WHERE conditions and primary index statistics work together!

Here are some examples:

— No statistic on the WHERE conditions available
SELECT * FROM The_Table WHERE col=10; — 10,000 rows are estimated, even if not a single row has a column value = 10 !

For each further ANDed column, 75% of the previous estimation are taken:

SELECT * FROM The_Table WHERE col=10 AND col2 = 20; — 10,000 * 75% = 7,500 rows are estimated

In case of ORed conditions, each column selected, accounts for 10%:

SELECT * FROM The_Table WHERE col=10 OR col2 = 20; — 10,000 + 10,000 = 20,000 rows are estimated

In case we have statistics on one or more of the WHERE condition columns, the one with the highest selectivity (which estimate the least rows to be returned) is applied:

COLLECT STATISTICS ON THE_TABLE COLUMN(col); — We assume the estimated cardinality is 5,000 rows
COLLECT STATISTICS ON THE_TABLE COLUMN(col2); — We assume the estimated cardinality is 10,000 rows

SELECT * FROM The_Table WHERE col=10 AND col2 = 20; — 5,000 * 75% = 3,750 rows are estimated

The statistics of column col (5,000 estimated rows) are used as a base for applying heuristics on col2.
The collected statistics of col2 are completely ignored!

The heuristics for columns, which are having collected statistics, are not always using the 75% rule.
If the value is skewed (being available more than 75% of the time), the real number is used.

For example, if the column col2 would contain the value = 20 in 90% of the rows, 90% would be used in the heuristic
calculation: 5,000 * 90% = 4,500

Another important fact is that as soon as any heuristics are included in the estimation,
the confidence level of the retrieving step is lowered to “no confidence”.
This will have an impact on the execution plan.

If you want to achieve high confidence, the Optimizer requires multi-column statistics:

COLLECT STATISTICS ON THE_TABLE COLUMN(col,col2);

Conclusion:

The statistics you mention absolutely make sense, but I would add as well statistic on the join column(s) of each join step, as this helps the Optimizer in join planning (order of join steps, join type, redistribution strategy, etc.)

I hope this helped

Roland

Hello Tom!

Thank you very much for the encouraging comment!
We are excited to receive such positive feedback from a master of the trade!

I suggest that we continue our discussion using my e-mail address:
[email protected]

Roland,

I really like the way you explain things. You hit important points without digressing into unimportant or confusing details. You write as I teach! I would like for you to consider co-authoring a book with me. You are that good!

Hi Roland,
I couldn’t agree more, there is so much misunderstanding in this area.
One of the most common things I hear is that “I collected statistics and it improved the confidence level. That is good isn’t it?”
The number of times I’ve had to explain to them that if the ** only ** change is confidence then the stats are a waste of time, they will not have improved performance. The following discussion is then about ‘structure of the explain plan’ with my finishing line being that “for statistics to be effective they MUST change the structure of the explain plan – otherwise they are a complete waste of resources”.
Cheers,
Dave

Roland Wenzlofsky

02/09/2015 at 8:04 pm

Confidence levels are a very good point, thanks, David.
Statistics are a topic which is close to my heart as in my opinion each performance optimization should start with a review of statistics (are statistics missing? are they stale?) but in my experience, this task is many times skipped for different reasons: Sometimes a lack of understanding, but sometimes the environment of the Data Warehouse does not allow to “experiment” with statistics.

I will give you an example: In one of my last projects, it was not possible to do proper performance optimization as only database administrators were allowed to add/remove statistics. Each required change needed at least one day. This actually forced me to skip statistics considerations completely and I had to rely on query rewriting only.

Another mistake I can see quite often is that people look only at the date of the last recollection and decide based on this date if statistics are stale.
Reply

A Comprehensive Guide to Teradata Statistics: Common Misconceptions, Best Practices, and Real Reasons Behind Performance Degradation

Introduction

Misconceptions about Teradata Statistics

Best Practices in Statistics Collection

The Real Cause of Performance Degradation

Conclusion

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

6 thoughts on “A Comprehensive Guide to Teradata Statistics: Common Misconceptions, Best Practices, and Real Reasons Behind Performance Degradation”

Leave a Comment Cancel reply