1. Complete and up-to-date Statistics
At the beginning of Teradata SQL Tuning, we always have to deal with statistics. Teradata Optimizer uses statistics to create our query's best possible execution plan.
Whether we collect statistics or dynamic AMP sampling is sufficient depends on the data demographics.
At the beginning of our optimization work, we must provide all the statistics that the optimizer needs and that they are up to date. The command DIAGNOSTIC HELPSTATS ON FOR SESSION is quite helpful here.
If we run it in a session, the optimizer will show us at the end of each execution plan (which we get by prepending EXPLAIN to our SQL statement) any statistics it would need to create a better plan.
Not all of the suggested statistics are necessary to improve the execution plan. Many suggestions will not change anything except the estimated number of rows.
Therefore, it is advisable to apply these suggestions step by step and check which of them really improve the plan and which only consume additional resources for collecting.
It should be apparent why we put statistics at the beginning of our optimizations. Since every statistic collected can change the execution plan, other optimizations may not be necessary or may even be nullified by collecting statistics.
The optimizer can collect statistics at runtime (dynamic AMP sampling) if no collected statistics are available. These can be sufficient depending on the data demographics, as I will show you later in this article.
For columns with many different values, dynamic AMP sampling is usually sufficient for the optimizer to get a good idea of the data demographics.
The table should contain significantly more rows than there are AMPs in the system.
In general, the more different values a column contains, the better dynamic AMP sampling represents the data demography. The primary key columns of a table are therefore particularly suitable, but also any column defined as UNIQUE.
Dynamic AMP sampling is not suitable for heavily skewed columns. Then the risk is high that the optimizer takes the sample from an AMP that is not representative and overestimates or underestimates the number of rows.
For the reasons mentioned before, good candidates for dynamic AMP sampling are UPI and USI.
The optimizer can use dynamic AMP sampling only on indexed columns. It is crucial to collect statistics for all non-indexed columns used in joins, WHERE conditions, or aggregations.
If the optimizer does not have statistics on a non-indexed column, it must use heuristics to estimate the number of selected rows. Heuristics are nothing more than rules of thumb and usually give completely wrong estimates.
Here is a detailed listing when dynamic AMP sampling is sufficient:
- USIs or UPIs used with equality predicates (such as WHERE conditions)
- NUSIs with an even distribution of column values
- NUPIs with an even distribution of column values
Fully collected Statistics
Fully collected statistics are required for skewed column values and columns with a small number of distinct values.
Here is a detailed list of when statistics should be collected:
- Non-indexed columns used in predicates (such as WHERE conditions)
- All NUSI with an uneven distribution of column values
- USIs and UPIs if used in non-equality predicates or range constraints)
- NUPIs with an uneven distribution of column values
- Join columns
- Small tables with less rows than AMPs
- The dummy column "PARTITION" for all tables
- Partitioning columns of row-partitioned tables
Statistics for Row-Partitioned Tables
The Teradata Optimizer has unique statistic requirements for PPI tables.
The following statistics should additionally be collected on PPI tables:
Dummy Column “PARTITION”
Statistics on dummy column “PARTITION” tell the optimizer the number of the empty statistics.
Dummy column “PARTITION” + PI column
These statistics are required if the partitioning columns are not part of the primary index. In this case, the same primary index value can exist in different partitions. Statistics on dummy column “PARTITION” + PI allow the optimizer to estimate the cost of the sliding window and rowkey based merge join and of dynamic partition elimination steps.
Teradata Multicolumn Statistics
Both statistics on single and multiple columns have advantages and disadvantages. The advantage of single statistics is that they can be used much more flexibly by the Optimizer. The advantage of multicolumn statistics is that they reveal correlations between columns, and the Optimizer can use this information. However, multicolumn statistics are not as universal as statistics on single columns.
The Optimizer can use statistics on (A, B) only for predicates on (A) or (A, B). Not if our predicate is B alone (such as "WHERE column = B").
The Optimizer can use individual statistics on (A) and (B) for predicates on (A), on (B) as well as on (A, B).
Here are the recommendations when you should use multicolumn statistics:
- Columns that are used together with equality predicates.
- Columns are used for joins or aggregations, with some degree of correlation. The more the combination of actual values is related, the more significant the importance of collecting multicolumn statistics is in this situation.
Identify Stale Statistics
There are several ways to identify stale statistics. The easiest way is to split the SQL statement and test each partial statement individually. Splitting is done merely by comparing the estimated number of rows (as shown in the Explain output) with the actual number of records returned by the query.
The above-described approach is particularly suitable if the entire SQL statement does not execute in a reasonable time.
Here's an example:
FROM <Tablename1> t01
ON t01.PK = t02.PK
t01.<Column name> = 1 AND t02.<Column name> = 2;
The above query can be divided into two parts for testing:
SELECT * FROM <Tablename1> WHERE <Columnnname> = 1;
SELECT * FROM <Tablename2> WHERE <Columnnname> = 2;
If we execute both sub-queries, and the number of rows returned differs significantly from the estimate in EXPLAIN, the statistics may be obsolete.
2. Teradata Primary Index Choice
Teradata SQL Tuning is not complete without choosing the best possible Primary Index. Design your queries so the Optimizer can use the primary index in the joins.
All join columns must be part of the primary index. If just one column of the primary index is missing in the join condition, the result is a different row hash (the order how they are defined in the primary index doesn't matter)
Still, the join condition can contain additional columns. After locating the rows via row hash, these are then applied as residual conditions.
If we want to join two tables, the rows of both tables must be co-located on the same AMP. This prerequisite is fulfilled when both tables have the same primary index. In this case, the Optimizer can use a join strategy that requires no redistribution of rows, a so-called "direct join".
If, for some reason, we cannot change the primary indexes, or if we need a specific primary index for a particular SQL statement, we can create a volatile table. It must have the same structure and content as the original table but with a different primary index.
There are three points to consider when selecting the Primary Index: uniform distribution of data, suitable for join operations, and low volatility:
Having an evenly distributed Primary Index is the only way to ensure that all AMPs start and finish their work simultaneously. This is exactly what parallel data processing is all about.
3. Teradata SQL Tuning with Indexing & Partitioning
Using indexes or partitioning is another way to improve query performance.
Teradata offers a number of different indexes, all of which have their advantages and disadvantages. Which index should be used when depends mainly on the workload.
The Teradata Secondary Index
Secondary indexes come in two forms. As unique secondary index (USI) and as non-unique secondary index (NUSI). Although one could conclude from the name that these differ only in the uniqueness, their functionality is very different.
The Teradata USI
The USI is very similar to the Unique Primary Index:
The Index Rows of the Index subtable are distributed evenly over all AMPS by Rowhash and sorted by RowHash.
If the USI contains at least all columns used in a WHERE condition, the AMP can be determined which owns the index row (in this case the index row).
The determined index row contains the ROWID of the base table row that is being searched for, and this can then be used to access the desired row.
The USI is suitable for direct access to individual rows and is therefore high-performance. It is ideal for tactical workload where this feature is required.
The Teradata NUSI
The NUSI is not distributed according to a Rowhash. NUSI rows are always held together with the AMP that has the corresponding row of the base table. Therefore a NUSI access is always an all-AMP operation.
The NUSI index rows are sorted by rowhash or by an integer value. Sorting by an integer value (date is internally of data type integer, and therefore also possible) is one of the advantages of the NUSI, as this makes it ideal for range scans (e.g. all days of a certain month).
The Teradata Row-Partitioned Table
While secondary indexes in Teradata are stored in a sub-table and therefore require extra space, partitioning is just another way to structure the rows on the mass storage device.
The rows of a partitioned table are not only distributed to the AMPs according to rowhash, but are also assigned to the appropriate partitions and inside the partitions sorted according to rowhash.
The rows of a partition are arranged on the disk in such a way that a full cylinder scan can be performed.
When you should partition a table, and when the use of a secondary index or join index is more appropriate, depends on the workload. It is also possible to create additional indexes on a partitioned table.
Partitioned tables are often used for strategic queries in which a series of data (for example, the sales for the current year) is aggregated.
Disadvantages of Indexing & Partitioning
When we work with indexing techniques, we need to keep an eye on the entire data warehouse architecture and decide whether our solution fits. Indexes can have a negative impact on the ETL process for several reasons.
Loading tools such as Fastload require secondary indexes and join indexes to be removed before loading.
The index sub-tables have to be managed by Teradata. Insert, Delete and Update statements require that in addition to the actual table, the index sub-table must be maintained.
If potentially useful indexes are not used by the optimizer and they are not helpful in the entire PDM design, drop them immediately. You're wasting space and resources.
4. Query Rewriting
The performance of a query can often be improved by rewriting the query.
Personally, I prefer to consider an SQL statement as a black box and to limit optimizations to technical methods first.
Here are a few examples:
- EXISTS instead of IN
- Splitting a large SQL statement into smaller parts
- UNION ALL instead of UNION
- DISTINCT instead of GROUP BY
Not having to understand the content and business logic of a query, I do not need to contact the author of the query. The purely technical optimizations are usually not that risky.
Only when I reach the limits of purely technical optimization do I begin to deal with business logic. Understanding business logic often opens up additional tuning opportunities.
Rewriting of queries often solves performance problems, even when all other techniques have failed.
5. Teradata SQL Tuning with Real-Time Monitoring
Teradata SQL tuning requires to watch the query running in real-time. Monitoring a query in viewpoint at runtime helps to identify the critical steps.
In Viewpoint you should look for the following steps:
- Steps that take a long time to finish and consume a huge amount of resources.
- Steps that are skewed, especially if the skew factor reaches 99%, as this means that all work is done by only one AMP or a few AMPs.
Analyzing the bad query
We have to think about the root cause of the bad query step, here are a few ideas:
- Does the base table have a skewed primary index (static skew)?
- Does the redistribution of rows cause skew (dynamic skew)?
- Are massive amounts of data redistributed because of poor join preparation due to missing or stale statistics?
- Are there several hash collisions during the execution of the INSERT statement?
Stale and missing statistics typically lead to incorrect decisions in join preparation (copying a table to all AMPs instead of rehashing) and to the use of incorrect join methods (for example, product join instead of merge join).
6. Comparison of Resource Usage
Another important task in Teradata SQL Tuning is measuring resource consumption before and after the optimization of the query.
Query run times are misleading because they can be affected by the simultaneous workload of other queries over which we have no control.
In performance tuning, we prefer to use absolute key figures that are independent of what else is running on our Teradata system.
Below is a query that gives you a detailed insight into how good each step of the SQL statement to be optimized is. To execute this query, you must have read access to table DBC.DBQLOGTBL (or related views):
Do not forget to give each version of the query to be optimized its own query band so that you can uniquely identify it in the table DBC.DBQLOGTBL:
SET QUERY_BAND = 'Version=1;' FOR SESSION;
(FIRSTRESPTIME-STARTTIME DAY(2) TO SECOND(6)) RUNTIME,
SPOOLUSAGE/1024**3 AS SPOOL_IN_GB,
NULLIFZERO(MAXAMPCPUTIME)) AS INTEGER) AS CPU_SKEW,
MAXAMPCPUTIME*(HASHAMP()+1) AS CPU_IMPACT,
AMPCPUTIME*1000/NULLIFZERO(TOTALIOCOUNT) AS LHR
QUERYBAND = 'Version=1;' ;
The above query will return the following measures:
- The total CPU Usage
- The Spool Space needed
- The LHR (ratio between CPU and IO usage)
- The CPU Skew
- The Skew Impact on the CPU
Our goal is to decrease total CPU usage, consumed spool space and skewing on the CPU. The optimal value for LHR is around 1.00