Get blazing fast Teradata queries with these 6 tricks

Roland Wenzlofsky

October 13, 2019

minutes reading time


1. Avoid multiple Joins to the same table

Each additional join causes either a full table scan or index access. Therefore, we should avoid joining multiple times to the same table. Often we can prevent numerous joins to the same table by using sub queries, volatile tables, or ordered analytical functions.

The following is an example of how we can use an ordered analytical function for this purpose:

SELECT
	A.SUBSCRIBER_ID,
 	B.VALUE AS INCOMING,
 	C.VALUE AS OUTGOING
FROM 
	SUBSCRIBERS A
	LEFT JOIN  CALLS B ON A.SUBSCRIBER_ID = B.SUBSCRIBER_ID AND B.INOUT_CD  = 'IN'
	LEFT JOIN  CALLS C ON A.SUBSCRIBER_ID = C.SUBSCRIBER_ID AND C.INOUT_CD  = 'OUT'
WHERE
	B.CALL_DATE BETWEEN DATE '2019-01-01' AND DATE '2019-12-31'
	C.CALL_DATE BETWEEN DATE '2019-01-01' AND DATE '2019-12-31'
SELECT
	A.SUBSCRIBER_ID,
 	B.INCOMING,
 	B.OUTGOING
FROM 
	SUBSCRIBERS A
	LEFT JOIN  
	(
		SELECT 
			SUBSCRIBER_ID,
			MAX(CASE WHEN INOUT_CD  = 'IN' THEN VALUE END) AS AVN_OFF_BAL,
			MAX(CASE WHEN INOUT_CD  = 'OUT' THEN VALUE END) AS BAL_OFF_BAL
		FROM
			CALLS 
		WHERE
			CALL_DATE BETWEEN DATE '2019-01-01' AND DATE '2019-12-31'
			AND INOUT_CD IN ('IN', 'OUT')
		GROUP BY 1
	) B
ON A.SUBSCRIBER_ID = B.SUBSCRIBER_ID;

2. Avoid Functions in Equi-Join Conditions

When used on join columns, most Teradata functions do not allow direct access via the ROWHASH. We usually use functions to join when our data model is terrible (not sufficiently normalized). Therefore we should always think about improving the data model first.

a.SUBSCRIBER_ID = TRIM(LEADING '0' FROM b.SUBSCRIBER_ID)

3. Avoid the Usage of UNION

Since UNION does not allow row duplicates, Teradata must deduplicate the spool space. Deduplication can only be done in one way: All branches are copied into a common spool, this is sorted, and then Teradata can eliminate duplicates. This alone is much more complex than a UNION ALL statement. What also has to be considered is that UNION does not allow any of the newer tuning features like pushing the joins into the individual branches. If you want to read more about this topic, you are right here:

Enhanced Performance Features for UNION ALL

Use UNION ALL instead of UNION whenever duplicates are not a problem (as shown below):

Instead of: 

SELECT   CAST ('FIRST'  AS VARCHAR (15)), PK FROM TheTable
UNION 
SELECT   CAST ('SECOND'  AS VARCHAR (15)), PK FROM TheTable;

Use:

SELECT   CAST ('FIRST'  AS VARCHAR (15)), PK FROM TheTable
UNION ALL
SELECT   CAST ('SECOND'  AS VARCHAR (15)), PK FROM TheTable;

4. Avoid GROUP BY over Expressions

Similar to joins, aggregations should not be defined using expressions and functions. Because to be able to perform aggregation, the rows must be distributed according to the ROWHASH. The following example shows that you can achieve better performance by not using concatenation but aggregating the underlying columns.

SELECT  COALESCE (SUBSCRIBER_ID ,0) || COALESCE(DESCRIPTION ,'') FROM THETABLE 
GROUP BY 
COALESCE (SUBSCRIBER_ID,0) || COALESCE(DESCRIPTION ,'');

Change to:

SELECT  COALESCE (SUBSCRIBER_ID ,0) || COALESCE(DESCRIPTION ,'') FROM THETABLE 
GROUP BY 
COALESCE (SUBSCRIBER_ID,0) , COALESCE(DESCRIPTION ,'');

5. Divide large Teradata SQL Queries into smaller ones

It would help if you thought about making your program logic modular. The larger an SQL statement becomes, the greater the risk you will reach certain limits (e.g., spool space). Furthermore, it would be best to consider that with the number of joins, the number of possible execution plans increases exponentially, and the risk that the optimizer uses a non-optimal plan increases. Good places to split the SQL statement are, e.g., SET operators (UNION, UNION ALL), sub queries, joins. Think about which sub queries you can put into your volatile tables. The advantage of volatile tables is that you can collect statistics and preprocess (e.g., resolve functions, so you don’t have to use them later in join columns).

6. Consider the Creation of additional Statistics and Indexes

All further optimizations only make sense if the required statistics are available and up-to-date. First, we want the optimizer to create an optimal plan based on the statistics. Only then should we think about other optimization techniques.
Please note that we do not necessarily mean collected statistics. Often dynamic AMP sampling is sufficient to create a good execution plan.

The fastest way to get an overview of our statistics is to run the command DIAGNOSTIC HELPSTATS ON FOR SESSION.

Each EXPLAIN statement executed in the same session shows a collection of recommended statistics and their confidence level at the end. We should take not all suggestions, but they will at least immediately show you the gross errors in your statistics landscape.

It is best to add the statistics step by step and check if and how the execution plan changes.

If the optimizer suggests statistics on very distinct columns, you have to consider if sample statistics are not the better choice:

COLLECT STATISTICS USING SAMPLE ON COLUMN aPrettyDistinctColumn;

Sample statistics are cheaper to collect and often sufficient in this case.

To check if your statistics are outdated, run a SELECT * on the table and compare it against the estimate in the Execution Plan.

If you want detailed insight into the statistics of a table, then the following command is beneficial:

SHOW STATISTICS VALUES ON aTable;

This command provides a lot of information about statistics on table level, column level, skew, etc.

Since you landed here, you might also be interested in this article about Performance Tuning:

{"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

You might also like

>