Introduction to Teradata Statistics Design for high performance
I attended an outstanding presentation on the Teradata statistics enhancements given by Thomas Mechtler, an extremely knowledgeable Senior Consultant at Teradata Austria. Below are some key takeaways I’d like to share.
Maintaining Teradata statistics was once a challenging and pivotal task, but with the advent of Teradata 14.10, it has become significantly less complicated while remaining equally important.
The latest database features and tools simplify the process of collecting statistics, easing the database administrator’s life.
Commonly asked questions about statistics include:
- Which statistics should be collected?
- Where in the ETL/ELT process should they be collected?
- How often should they be collected?
- Which statistics are worthless because they are never used?
Before Teradata 13.10, answering the aforementioned inquiries promptly was unfeasible.
There was no universally applicable solution when determining the optimal position within the ETL process.
What is the best course of action if multiple transformation steps are executed consecutively using the same target table? Should statistics be gathered repeatedly after each step to ensure accurate statistics for each step, even if it uses more resources? Or should one wait until all steps are complete, even if it means using a sub-optimal execution plan?
The same applies to the frequency of collecting statistics.
Tables may require different collection frequencies, some daily and others weekly. How do you manage this? Do you run two separate collection processes, one for weekly and one for daily? Or do you settle for imprecise data by collecting statistics only once a week? Alternatively, do you bear the cost of resource consumption by running a daily collection process?
Your decision has advantages and disadvantages, resulting in a tradeoff between resource usage and estimation accuracy.
Fortunately, the situation significantly improved with Teradata 14.10.
Teradata Statistics Improvements on 14.10
Teradata 14.10 introduces new features that ease or solve the aforementioned problems, aiding us in:
- Identify unused statistics
- Skip the collection of statistical data if data demographics are unchanged or are below a certain threshold
- Identify missing statistics that the Optimizer would need to make a better execution plan.
Before utilizing the extra functionality, certain technical requirements must be met. To avoid collecting statistics for unaltered data demographics, it is necessary to enable object usage counts (OUC).
BEGIN QUERY LOGGING WITH USECOUNT ON "TheDatabase";
Enabling object usage count logging for all database objects in “TheDatabase” includes statistics, which are also considered database objects alongside tables, views, and others. This is the process.
After activating OUC, every access to an object (particularly access to a statistic) is recorded in the DBC table named DBC.ObjectUsage.
The DBC.ObjectUsage and other relevant DBC tables provide valuable information for identifying unused statistics.
When was OUC activated for the considered statistics?
DBC.DBQLRuleTbl.TimeCreated
When were the statistics added?
DBC.StatsTbl.CreateTimeStamp
When were the Optimizer’s statistics used the last time (probably the most important question)?
DBC.ObjectUsage.LastAccessTimeStamp:
I showed how to detect unused statistics, but what about the missing statistics necessary for the Optimizer to create its execution strategy?
Teradata 14.10 now includes two extra logging options that provide precise information.
BEGIN QUERY LOGGING WITH XMLPLAN, STATSUSAGE ON TheDatabase;
XMLPLAN provides statistics that were not available to the Optimizer and includes information for each query step. Additionally, it displays estimated and actual row numbers. However, it is only available in XML format and requires coding and parsing.
STATSUSAGE is more user-friendly in tabular format, albeit less comprehensive than the XMLPLAN data. Nevertheless, it is a superior option for optimizing SQL compared to the conventional method, which involves:
HELPSTATS ON FOR SESSION;
Is it time to recollect? Lean back and let Teradata decide for you!
The preceding chapter facilitates the identification of unutilized metrics. However, this section aims to address the remaining issues that we initially highlighted in this article:
How frequently should statistics be collected?
Good news: Teradata 14.10 has an “autopilot mode.” Predefined measures and threshold levels can trigger statistics recollection, but OUC must be enabled.
Teradata uses the UDI counts (Updates, Deletes, Inserts), which are part of OUC, and historical statistics histograms to decide if statistics are stale. UDI counts count how many rows have been deleted or inserted, and it collects information about updates on the column level.
You can execute the collect statistics statements as frequently as desired without burdening your system.
The collection process is initiated only when Teradata is certain the statistics are outdated.
No more compromises. The two pivotal questions, “When should I recall? How frequently should I recall?” have disappeared…
When operating in “autopilot mode” on Teradata 14.10 with UOC enabled, any negative effects resulting from design flaws in your statistics collection process are minimized.
Avoid discarding statistics before retrieving them to prevent the loss of historical data.
Threshold levels can be manually defined. However, it is recommended to let Teradata perform this task for you.
COLLECT STATISTICS USING THRESHOLD 14 DAYS AND THRESHOLD 5 PERCENT COLUMN (TheColumn) ON TheDatabase;
The example would trigger recollection every 14 days if the table’s cardinality changed by a minimum of 5%. The Optimizer relies on UDI and historical statistics histograms to conduct its analysis.
Final thoughts: Many of you will encounter a statistics framework that has developed over time. Upgrade to Teradata 14.10 and take advantage of the latest features instead. Retain your current system, but consider how the new approach can improve it. Transitioning away from your current solution may require effort, but it is worthwhile.
Hey Roland. Thx for the article ?
Do you think it is possible to view use counts for columns on particular tables and databases using DBC.ObjectUsage ?
I would like to use it to determine most frequently accessed column to improve query performance. Might be useful to review PI / SI.
Is it counting SELECT statements or UDI only ? Is it user level or system level statistics ?
Might be very useful. I might test it in couple of days.
Thank you!