Teradata Statistics for High-Performance Queries
Today I attended an excellent presentation about the Teradata statistics improvements, presented by Thomas Mechtler, who is a very experienced Senior Consultant at Teradata Austria. I carved out some of the major points I wanted to share with you in this article.
In the past, Teradata statistics maintenance was as difficult as it is important, easily but this changed a lot with the introduction of Teradata 14.10, it's not difficult anymore (but still important):
New database features and tools simplify the statistics collection process, making the life of the database administrator easier
Some of the central questions about statistics usually raised are:
- Which statistics should be collected?
- Where in the ETL/ELT process should they be collected?
- How often should they be collected?
- Which statistics are worthless, because they are never used?
The problem with the above questions is that until Teradata 13.10 they couldn't be answered quickly.
For example, when we had to find out the best place in the ETL process, there was no one-size-fits-all answer:
What for instance, if several transformation steps are executing one after the other, populating the same target table, which itself is used again by each transformation step? Would you collect statistics over and over again after each transformation step to support each step with the correct statistics (accepting the additional resource consumption)? Would you wait until the last transformation process is finished (taking possible sub-optimal executions plan)?
Similar is the situation when it comes to the frequency of statistics recollection:
While some of your tables might change so fast that a daily collection is required, there might be other tables requiring only a weekly recollection. How would you handle this situation? By running two recollection processes, one weekly and one daily? Or would you accept inaccurate statistics and only collect statistics once a week? Or even take the resource consumption overhead, by running a daily recollection process?
Whatever was your decision: It came with advantages and disadvantages, namely a tradeoff between resource usage and accuracy of estimations.
Luckily, the situation improved a lot with Teradata 14.10:
Teradata Statistics Improvements on 14.10
Teradata 14.10 introduces several new features that solve, or at least ease, above mentioned problems. These improvements help us to:
- Identify unused statistics
- Skip the collection of statistical data if data demographics are unchanged or are below a certain threshold
- Identify statistics which are missing but would be needed by the Optimizer to make a better execution plan
Before we can use the additional functionality, some technical requirements have to be fulfilled. If we want to skip statistics recollection for unchanged data demographics, object usage counts (OUC) have to be activated:
BEGIN QUERY LOGGING WITH USECOUNT ON "TheDatabase";
The previous statement turns on object usage count logging for all database objects in the database “TheDatabase.” Statistics are database objects like any other object, such as tables, views, etc. This is how it works.
As soon as OUC is activated, each object access (or what's interesting for us: access to a statistic) is counted in the DBC table DBC.ObjectUsage
DBC.ObjectUsage and some other DBC tables offer some valuable information which helps us to identify unused statistics:
When was OUC activated for the considered statistics?
When were the statistics added?
When were the statistics used the last time by the Optimizer (probably the most important question)?
I showed you how to detect unused statistics, but what about the ones who are missing (Those statistics, the Optimizer would need to build its execution plan)?
Two additional logging options have been implemented in Teradata 14.10, giving us exactly this information:
BEGIN QUERY LOGGING WITH XMLPLAN, STATSUSAGE ON TheDatabase;
XMLPLAN gives you information about statistics the optimizer would have needed, but they were not available. This information is available for each step of your query. Further, it outputs the number of estimated rows and the number of actual rows. Unfortunately, as the name suggests, it is the output in XML format and requires some coding and parsing.
STATSUSAGE is easier to use as a regular table but not as detailed as the XMLPLAN output. Still, it's much better for SQL tuning than the traditional approach, which is:
HELPSTATS ON FOR SESSION;
Is it time to recollect? Lean back and let Teradata decide for you!
While the information we mentioned in the previous chapter helps us to find unused statistics, in this section we are covering the remaining problems we mentioned at the begin of this article:
When and how often should you recollect statistics?
I have good news: Teradata 14.10 implements something like an “autopilot mode.” Statistics recollection can be triggered by predefined measures and threshold levels – but only if OUC is enabled!
Teradata makes use of the UDI counts (Updates, Deletes, Inserts), which are part of OUC, and historical statistics histograms to decide if statistics are stale. UDI counts count how many rows have been deleted or inserted, and it collects information about updates on the column level.
As a consequence, you can execute the collect statistics statements as often as you like, without overloading your system:
The actual collection process is only activated if Teradata is convinced that the statistics are stale!
No more compromises. The two central questions “When should I recollect? How often should I recollect?” just faded away…
In “autopilot mode” (default on Teradata 14.10 with UOC enabled), the impact of design flaws in your statistics recollections process is being minimized.
It's important that you don't drop statistics before recollecting them, as you would lose all historical statistic information.
Just by the way, threshold levels can be manually defined by you, but most probably you should let Teradata do this job for you:
COLLECT STATISTICS USING THRESHOLD 14 DAYS AND THRESHOLD 5 PERCENT COLUMN (TheColumn) ON TheDatabase;
The above example would trigger recollection whenever 14 days past by and table cardinality changed by at least 5% (again, the Optimizer makes use of UDI and historical statistics histograms for its analysis).
A few last words: Most of you will be confronted with a historically grown statistics framework. With Teradata 14.10 it's time to replace it and to make use of the new features. Don't throw away what you have, just think about how the new process could supersede your existing one. Fading out your current solution may take some time, but it's worth it.