9

Teradata SQL Tuning – 6 Golden Rules you must never forget!

1. Ensuring the Completeness and Correctness of Statistics

Teradata SQL Tuning begins by providing the optimizer with the statistics it needs. This must always be done as a first step. Statistics influence the execution plan. It is not only important to define all required statistics, but also to keep them as up-to-date as possible.

There is a simple reason why I put statistics first: Often performance problems are solved simply by changing the execution plan.

Which columns require which type of statistics (full statistics collectin, random-AMP samples or sample statistics) depends primarily on the data demography.

Random-AMP sampling

Random AMP sampling is usually sufficient for columns with many different values. The table should contain significantly more rows than there are AMPs in the system.
The following applies to random-AMP sampling: The more distinct the column values are, the better the estimate will be.

If the column values are skewed, there is a risk that the random-AMP sample will be taken from an AMP that is not representative of the data. Skewed data leads to overestimation or underestimation of the number of rows.

Good candidates for Random-AMP sampling are unique indices, i.e., UPI and USI.

Random-AMP sampling only takes place on indexed columns.

Therefore, You should always collect statistics on non-indexed columns that are used in WHERE conditions. Whether full or sample statistics are collected depends on how unique the column values are.

Fully collected Statistics

Fully collected statistics are required for skewed column values and columns with a small number of distinct values (NUPI and NUSI)

Statistics for PPI Tables

The Teradata Optimizer has unique statistic requirements for PPI tables.

The following statistics should additionally be collected on PPI tables:

Dummy Column “PARTITION”

Statistics on dummy column “PARTITION” tell the optimizer the number of the empty statistics.

Dummy column “PARTITION” + PI column

These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions. Statistics on dummy column “PARTITION” + PI allow the optimizer to estimate the cost of the sliding window and rowkey based merge join and of dynamic partition elimination steps.

Below statement can be used to determine which statistics the optimizer would additionally need in a SQL statement:

DIAGNOSIS HELPSTATS ON FOR SESSION;

This statement displays a list of suggested statistics at the end of the Execution Plan (given by the EXPLAIN statement) and the Optimizer’s opinion of their value (confidence levels).

By gradually adding these statistics, you can test their influence on the execution plan.

Identify Stale Statistics

There are several ways to identify stale statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.

The above-described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.

Here’s an example:

SELECT t01.*
FROM
<Tablename1> t01
INNER JOIN
<Tablename2> t02
ON
t01.PK = t02.PK
WHERE
t01.<Column name> = 1 AND t02.<Column name> = 2;
The above query can be divided into two parts for testing:
SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;
SELECT * FROM <Tablename2> WHERE <Columnnname> = 2; 

If you execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.

2. Teradata SQL Tuning and the Primary Index Choice

Teradata SQL Tuning is not complete without choosing the best possible Primary Index. Design your queries so that the primary index can be used in the joins.

All join columns must be part of the primary index. If only one column of the primary index is missing in the join condition, the result is a different row hash (the order how they are defined in the primary index doesn’t matter)

Still, the join condition can contain additional columns. These are then applied as residual conditions after locating the rows via row hash.

To execute a join between two tables, the rows of both tables must be co-located on the same AMP.

This is true when they have the same primary index. In this case, the optimizer can use a join strategy that requires no redistribution of rows.

If for some reason you cannot change the primary indexes, or if you need a specific primary index for a particular SQL statement, create a volatile table (or a true temporary table) with the same structure and content as the original table but with a different primary index.

This is the only way to ensure that all AMPs start and finish their work simultaneously. This is exactly what parallel data processing is all about.

There is something else to consider when selecting the primary index: Since every change of a value of a primary index column means a new row hash, these rows must be redistributed.
This can still be acceptable if it happens now and then and only in a few rows. However, if the primary index is very volatile, this has a negative effect on performance.

3. Teradata SQL Tuning with Indexing & Partitioning

Using indexes or partitioning is another way to improve query performance.

While secondary indexes in Teradata are stored in a sub-table and therefore require extra space, partitioning is just another way to structure the rows on the mass storage device.

When it is the best option to use partitioning and when indexing is better depends on the workload type. Sometimes a combined usage of partitioning and indexes is the best approach (i.e. creating an index on a partitioned table).

Tactical queries normally access one or a few data rows. The USI is therefore often an optimal choice if the access can’t be covered with a primary index. Also, a join index that represents a copy of the base table and only changes the primary index can be helpful here.

Partitioned tables are often used for strategic queries in which a series of data (for example, the sales for the current year) is aggregated.

Disadvantages of Indexing

When we work with indexing techniques, we need to keep an eye on the entire data warehouse architecture and decide whether our solution fits. Indexes can have a negative impact on the ETL process for several reasons.

Loading tools such as Fastload require secondary indexes and join indexes to be removed before loading.

The index sub-tables have to be managed by Teradata. Insert, Delete and Update statements require that in addition to the actual table, the index sub-table must be maintained.

If potentially useful indexes are not used by the optimizer and they are not helpful in the entire PDM design, drop them immediately. You’re wasting space and resources.

4. Query Rewriting

The performance of a query can often be improved by rewriting the query.

Personally, I prefer to consider an SQL statement as a black box and to limit optimizations to technical methods first.

Here are a few examples:

  • EXISTS instead of IN
  • Modularizing a large SQL statement
  • UNION ALL instead of UNION
  • DISTINCT instead of GROUP BY

Not having to understand the content and business logic of a query, I do not need to contact the author of the query. The purely technical optimizations are usually not that risky.

Only when I reach the limits of purely technical optimization do I begin to deal with business logic. Understanding business logic often opens up additional optimization opportunities.

Rewriting of queries often solves performance problems, even when all other techniques have failed.

5. Teradata SQL Tuning with Real-Time Monitoring

Teradata SQL tuning requires to watch the query running in real-time. Monitoring a query in viewpoint at runtime helps to identify the critical steps.

In Viewpoint you should look for the following steps:

  • Steps that take a long time to finish and consume a huge amount of resources.
  • Steps that are skewed, especially if the skew factor reaches 99%, as this means that all work is done by only one AMP or a few AMPs.

Analyzing the bad query

We have to think about the root cause of the bad query step, here are a few ideas:

  • Does the base table have a skewed primary index (static skew)?
  • Does the redistribution of rows cause skew (dynamic skew)?
  • Are massive amounts of data redistributed because of poor join preparation due to missing or stale statistics?
  • Are there several hash collisions during the execution of the INSERT statement?

Stale and missing statistics typically lead to incorrect decisions in join preparation (copying a table to all AMPs instead of rehashing) and to the use of incorrect join methods (for example, product join instead of merge join).

6. Comparison of Resource Usage

Another important task in Teradata SQL Tuning is measuring resource consumption before and after the optimization of the query.

Query run times are misleading because they can be affected by the simultaneous workload of other queries over which we have no control.

In performance tuning, we prefer to use absolute key figures that are independent of what else is running on our Teradata system.

Below is a query that gives you a detailed insight into how good each step of the SQL statement to be optimized is. To execute this query, you must have read access to table DBC.DBQLOGTBL (or related views):

Do not forget to give each version of the query to be optimized its own query band so that you can uniquely identify it in the table DBC.DBQLOGTBL:

SET QUERY_BAND = 'Version=1;' FOR SESSION;
SELECT
   AMPCPUTIME,
   (FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,
   SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,
   CAST(100-((AMPCPUTIME/(HASHAMP()+1))*100/NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS
CPU_SKEW,
   MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,
   AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR
FROM
   DBC.DBQLOGTBL
WHERE
     QUERYBAND = 'Version=1;'
;

The above query will return the following measures:

  • The total CPU Usage
  • The Spool Space needed
  • The LHR (ratio between CPU and IO usage)
  • The CPU Skew
  • The Skew Impact on the CPU

Our goal is to decrease total CPU usage, consumed spool space and skewing on the CPU. The LHR is optimally around 1.00

See also:
Teradata Golden Tuning Tipps 2017 – Take your Skills to the next Level!
Teradata Tuning – Query Rewriting
Tactical Workload Tuning on Teradata
The Primary Index Choice
Tutorial Point Tuning Ideas

Teradata Quiz
Teradata Quiz
Developer: Roland Wenzlofsky
Price: Free
Roland Wenzlofsky
 

Roland Wenzlofsky is a graduated computer scientist and Data Warehouse professional working with the Teradata database system for more than 15 years. He is experienced in the fields of banking and telecommunication with a strong focus on performance optimization.

  • Avatar Rémi says:

    From a pedagogical point of view, I would actually rank “statistics” at the end of this list.

    I consistently realize that statistics tend to be over-collected due to staff’s “low confidence” in pretty much all the other points you mention here. The magic of the DIAGNOSTIC often results in shifting the attention away from more fundamental design problems, which statistics are just intended to complete.

    Great post!

  • Avatar Timm says:

    Agree with Remi :). Way too much stats all over the place. I would also add proper table partitioning to the golden rules.

  • Roland Wenzlofsky Roland Wenzlofsky says:

    Yes, partitioning definitely should be part of the golden rules.

    Regarding statistics, I am always struggling against showing the DIAGNOSTIC HELPSTATS statement. On the one hand it is a fast and comfortable way to analyze the statistics landscape, on the other hand I have seen many people starting to blindly add all recommendations after I showed them this possibility.

  • Avatar Srihari B says:

    Roland,

    Should we collect stats every time a table gets refreshed ( either inserts or updates or both) ? Lets says we have some 500 tables in DWH which gets refreshed every day, in this case, do we have to take stats for all 500 tables every day ?

  • Roland Wenzlofsky Roland Wenzlofsky says:

    This is a question which can’t be answered easily. New releases of Teradata has some helpful features like statistic extrapolation and comparision of random AMP sampling against collected statistic. This may help in reducing recollection intervals.

    There are different approaches which could be taken. The refresh interval definitely depends on how fast data changes. Usually I sneak some code into central components like the historization process. You could for example use the ACTIVITYCODE variable in BTEQ to get the number of new rows added to a table. This may be helpful in estimating the table growth and this insight can be used for a timely recollection of statistics without any overhead.

    Statistics collection is always a tradeoff between keeping the system performing and the overhead caused by the collection process.

  • Avatar Rutuja says:

    INSERT INTO eitopsdb.GTT_srv_ft_data_grp_nbr_rk

    SELECT BASE.srv_accs_id
    ,BSS.data_grp_nbr
    FROM Eitopsdb.srvft_blstmt_usgvol_fltavg_rk BASE INNER JOIN
    EDWVIEWS.bl_stmnt_sub BSS
    ON
    BASE.srv_accs_id = BSS.srv_accs_id
    — WHERE bl_cyc_dt BETWEEN DATE -31 — insure one month ago
    — AND DATE -1 — Yesterday
    QUALIFY 1 = ROW_NUMBER() OVER
    (PARTITION BY BSS.srv_accs_id
    ORDER BY BSS.bl_cyc_dt DESC, BSS.data_grp_nbr DESC)

    Could anybody help in tuning this query??

    It is failing with below error-

    *** Failure 3156 Request aborted by TDWM. Exception criteria exceeded: Spool Usage.
    Statement# 1, Info =0

  • Avatar Dhruba Barman says:

    Hi Roland,
    Is it fine to collect stats on all columns recommended by DIAGNOSTIC command ?
    Is there any specific way to know as to which stats recommendations should be implemented and which ones to skip ?
    I generally collect stats on single column stats recommendations and skip the multi-column stats. But there have been instances where multi-column stats recommendations helped after implementation.

    Regards,
    Dhruba

    • Not all recommended statistics are needed. I follow a similar approach. Especially when it comes to multi-column statistics we have to consider that their application is limited compared to single column stats. While the optimizer can use the first part of a statistic on columns (a,b) i.e. the “a”, it will not be able to use a query which, for example, has a WHERE condition only on column “b”. Of course, the ideal scenario for the statistic on columns(a,b) is a WHERE condition ob both columns.

      It’s similar to statistics on functions, such as on CASE statements. They are only useful for a very specific kind of query which matches exactly this function.

  • >