Teradata SQL Tuning Evergreens

 

  1. Ensure Completeness and Correctness of Teradata Statistics

The most crucial optimization task is to provide the Teradata Optimizer with all necessary statistics. Statistic collection should always be done first before doing any further optimization considerations.

On which columns which kind of statistics (Full Statistics, Random-AMP Sampling, or Sample Statistics) are required depends primarily on data demography.

Random-AMP sampling is usually sufficient for unskewed data, and if the table is large enough. The table should contain significantly more rows than there are AMPs in the system.

If the data are skewed, there is a high probability that the sample is from an AMP that is not representative of the data. Skewed data leads to an overestimation or underestimation of the number of rows.

Good candidates for Random-AMP sampling are unique indices, i.e., UPI and USI.

However, random AMP sampling only takes place on indexed columns. Therefore, we always must collect full statistics on columns listed in WHERE clauses.

Full statistics are also appropriate for skewed column values and columns with a small number of distinct values, such as NUPI and NUSI.

The Teradata Optimizer also has unique requirements regarding statistics when it comes to PPI tables.

Statistics should be collected here in any case:

Column PARTITION

Statistics on column “PARTITION” tell the optimizer how many partitions are empty.

PARTITION + PI column

These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions.

Statistics on column “PARTITION” + PI help the optimizer to estimate the cost of the sliding window and rowkey based merge join, and dynamic partition elimination.

We can make the following statement to determine what statistics the optimizer lacks:

DIAGNOSIS HELPSTATS ON FOR SESSION;

Above statement displays a list of missing statistics at the end of the Execution Plan (EXPLAIN statement) and the Optimizer’s estimation of their importance (Confidence Levels).

By gradually adding these statistics, you can test their influence on the improvement of the execution plan.

There are several ways to find old statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.

Above described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.

Here’s an example:

SELECT t01.* FROM <Tablename1> t01
INNER JOIN
table name2> t02
ON t01.PK = t02.PK
WHERE t01.<Column name> = 1 AND t02.<Column name> = 2;

The above query can be divided into two parts for testing:

SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;

and

SELECT * FROM <Tablename2> WHERE <Columnnname> = 2;

If you execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.

  1. The Primary Index (PI) Choice

Design your queries so that the primary index can be used in the joins.

To do this, all join columns must be contained in the primary index. If only one column of the primary index is missing in the join condition, the result is a different row hash!

However, the join condition can contain additional columns. These are then applied as residual conditions.

To execute a join between two tables, the rows of both tables must exist on a common AMP.

This is exactly the case if they have the same primary index. In this case, the optimizer can use a join strategy that does not involve redistributing rows.

If for some reason you cannot change the primary indexes, or if you only need a primary index for very specific SQL statements, use a volatile table or a real temporary table with the same structure and content as the original table, but with a different primary index.

When choosing the primary index, it is also important to distribute the rows of the tables as evenly as possible over all AMPs.

Only then is it guaranteed that all AMPs involved in the execution of the execution plan start and end their work simultaneously. This is what parallel data processing is all about.

There is something else to consider when selecting the primary index: Since every change to a value of a primary index column means a new row hash, rows must be redistributed.

This may still be acceptable if this happens now and then and only on a few rows. However, if the primary index is very volatile, this will have a negative impact on performance.

  1. Teradata Indexing & Partitioning

Using indexes or partitioning is another way to improve query performance.

While secondary indexes in Teradata are stored in a sub-table and therefore require extra space, partitioning is just another way to store the rows on the disk.

When is the best time to use partitioning and when indexing depends on the workload. Sometimes a combination of both makes sense (i.e. creating an index on a partitioned table).

Tactical queries usually access individual or a few data records and are often easy to use with a USI. But also a join index that represents a copy of the base table and only changes the primary index can be helpful here.

Partitioned tables are often used for strategic queries in which a range of data (for example, the sales for the current year) is added up.

When you work with indexing techniques, you need to keep an eye on the entire data warehouse architecture and decide whether your solution fits.

Above all, it is important to check the effect on the ETL charging process. While the use of partitioning has practically no restrictions, secondary indexes and join indexes have various disadvantages.

On the one hand, certain loading tools such as Fastload require secondary indexes and join indexes to be removed before loading.

On the other hand, the index sub-tables also have to be maintained. Insert, Delete and Update statements mean that in addition to the actual table, the index sub-table must also be maintained.

If potentially useful indexes are not used by the optimizer and they are not helpful in the entire PDM design, drop them immediately. You’re wasting space and resources.

More details about indexing you can find here

  1. Query Rewriting

The performance of a query can often be improved by rewriting the query.

Personally, I prefer to consider an SQL statement as a black box and to limit optimizations to technical methods first.

Here are a few examples:

  • EXISTS instead of IN
  • Modularizing a large SQL statement
  • UNION ALL instead of UNION
  • DISTINCT instead of GROUP BY

Not having to understand the content and business logic of a query, I do not need to contact the author of the query. Purely technical optimizations are usually also less risky.

Only when I reach the limits of purely technical optimization do I begin to deal with business logic. Understanding the business logic often reveals additional optimization possibilities.

Rewriting of queries often solves performance problems, even when all other techniques have failed.

  1. Real-Time Monitoring

Watch the query in real time. Monitoring a query in viewpoint at runtime helps to identify the critical steps.

In Viewpoint it makes sense to look for the following steps:

Steps that run relatively long and consume relatively many resources.

Steps that run skewed. Especially if the skew factor reaches 99%, this means that all work is done by only one AMP.

In a next step we try to find out the cause, here are a few ideas:

Poor primary index of base tables?
Does the redistribution of rows cause skew?
Are large amounts of data redistributed because of poor join preparation due to missing or old statistics?
Are there many hash collisions during the execution of the INSERT statement?

Obsolete and missing statistics typically lead to incorrect decisions in join preparation (copying a table to all AMPs instead of reashing) and to the use of incorrect join methods (for example, product join instead of merge join).

  1. Comparison of Resource Usage

Another important task in performance optimization is measuring resource consumption before and after optimization.

Query runtimes are misleading because they can be influenced by simultaneous workload on which we have no influence.

What we prefer as a performance tuner are absolute key figures that do not change with the competitive workload.

Here is an example of a query that gives you a detailed insight into how good each step of your query is. To execute this query, you need select access to table DBC. DBQLOGTBL.

Set a different QUERYBAND for each version of the query that you execute:

SET QUERY_BAND = ‘Version=1;’ FOR SESSION;

SELECT

   AMPCPUTIME,

   (FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,

   SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,

   CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW,

   MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,

   AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR

FROM

   DBC.DBQLOGTBL

WHERE

     QUERYBAND = ‘Version=1;’

The query will return:

  • The total CPU Usage
  • The Spool Space needed
  • The LHR (ratio between CPU and IO usage)
  • The CPU Skew
  • The Skew Impact on the CPU

Our goal is to cut total CPU usage, consumed spool space and skewing on the CPU. Furthermore, the LHR is optimally around 1.00

Teradata Quiz
Teradata Quiz
Price: Free
Our Reader Score
[Total: 53    Average: 4.1/5]
6 Golden Rules for Teradata SQL Tuning written by Roland Wenzlofsky on October 29, 2014 average rating 4.1/5 - 53 user ratings

9 COMMENTS

  1. Hi Roland,
    Is it fine to collect stats on all columns recommended by DIAGNOSTIC command ?
    Is there any specific way to know as to which stats recommendations should be implemented and which ones to skip ?
    I generally collect stats on single column stats recommendations and skip the multi-column stats. But there have been instances where multi-column stats recommendations helped after implementation.

    Regards,
    Dhruba

    • Not all recommended statistics are needed. I follow a similar approach. Especially when it comes to multi-column statistics we have to consider that their application is limited compared to single column stats. While the optimizer can use the first part of a statistic on columns (a,b) i.e. the “a”, it will not be able to use a query which, for example, has a WHERE condition only on column “b”. Of course, the ideal scenario for the statistic on columns(a,b) is a WHERE condition ob both columns.

      It’s similar to statistics on functions, such as on CASE statements. They are only useful for a very specific kind of query which matches exactly this function.

  2. INSERT INTO eitopsdb.GTT_srv_ft_data_grp_nbr_rk

    SELECT BASE.srv_accs_id
    ,BSS.data_grp_nbr
    FROM Eitopsdb.srvft_blstmt_usgvol_fltavg_rk BASE INNER JOIN
    EDWVIEWS.bl_stmnt_sub BSS
    ON
    BASE.srv_accs_id = BSS.srv_accs_id
    — WHERE bl_cyc_dt BETWEEN DATE -31 — insure one month ago
    — AND DATE -1 — Yesterday
    QUALIFY 1 = ROW_NUMBER() OVER
    (PARTITION BY BSS.srv_accs_id
    ORDER BY BSS.bl_cyc_dt DESC, BSS.data_grp_nbr DESC)

    Could anybody help in tuning this query??

    It is failing with below error-

    *** Failure 3156 Request aborted by TDWM. Exception criteria exceeded: Spool Usage.
    Statement# 1, Info =0

    • Hi – The duplicate filtering should be taken care before participating in to join operation . Share the explain plan of this Query for more detail.

  3. This is a question which can’t be answered easily. New releases of Teradata has some helpful features like statistic extrapolation and comparision of random AMP sampling against collected statistic. This may help in reducing recollection intervals.

    There are different approaches which could be taken. The refresh interval definitely depends on how fast data changes. Usually I sneak some code into central components like the historization process. You could for example use the ACTIVITYCODE variable in BTEQ to get the number of new rows added to a table. This may be helpful in estimating the table growth and this insight can be used for a timely recollection of statistics without any overhead.

    Statistics collection is always a tradeoff between keeping the system performing and the overhead caused by the collection process.

  4. Roland,

    Should we collect stats every time a table gets refreshed ( either inserts or updates or both) ? Lets says we have some 500 tables in DWH which gets refreshed every day, in this case, do we have to take stats for all 500 tables every day ?

  5. Yes, partitioning definitely should be part of the golden rules.

    Regarding statistics, I am always struggling against showing the DIAGNOSTIC HELPSTATS statement. On the one hand it is a fast and comfortable way to analyze the statistics landscape, on the other hand I have seen many people starting to blindly add all recommendations after I showed them this possibility.

  6. From a pedagogical point of view, I would actually rank “statistics” at the end of this list.

    I consistently realize that statistics tend to be over-collected due to staff’s “low confidence” in pretty much all the other points you mention here. The magic of the DIAGNOSTIC often results in shifting the attention away from more fundamental design problems, which statistics are just intended to complete.

    Great post!

LEAVE A REPLY

Please enter your comment!
Please enter your name here