Teradata Tuning with Query Rewriting
In this article, we will learn how to improve the performance of SQL statements by rewriting them. This is a scenario from one of my recent projects.
The following SQL statement showed a high runtime and therefore had to be analyzed in more detail:
,CASE WHEN t02.PK IS NULL THEN 'NEW' ELSE 'CHANGED' END AS Type_Indicator
t01.PK = t02.PK AND
AND PK2.TheDate <> DATE'9999-12-31'
PK1.TheDate = DATE'9999-12-31'
GROUP BY 1,2,3...,59,60;
The table from which we select has the following structure:
CREATE TABLE TheTable
PK INTEGER NOT NULL,
TheDate DATE FORMAT 'YYYY-MM-DD',
) PRIMARY INDEX(PK);
We are selecting from a table holding one record per object and reference date (columns “PK” and “TheDate”). The table contains about 150 million rows and an extensive number of columns.
The query’s goal is to select the present rows of each object (which can be identified by the date ‘9999-12-31’) and to display if there is at least one historical row. We use the column “Type_Indicator” to differentiate between objects with history and objects without history (“OLD” or “NEW”)
If we briefly analyze the SQL statement, we will see that it correctly delivers the above-described result. The problem is that this query returns the correct result but takes 2 hours.
We will first analyze the query in its individual parts to identify the weaknesses in the execution plan.
As we do a self-join, Teradata returns each recent row several times (each time when it matches a historical row). A GROUP BY statement is needed to get rid of the duplicates.
Although the join is done AMP-local (as the join column matches the Primary Index), we have a costly redistribution step for the GROUP BY statement. Furthermore, the spool has to be sorted by all 60 columns to be able to do the grouping.
Costs are increasing rapidly with the number of table rows and columns being grouped.
Now that we know the query’s weak points let’s analyze how we could improve it.
In the original query, we are joining all historical rows against the current row to detect if at least one historical row exists. We can do this in a better way:
, CASE WHEN t02.PK IS NULL THEN 'NEW' ELSE 'CHANGED' END AS Type_Indicator
SELECT DISTINCT PK
WHERE TheDate <> DATE'9999-12-31'
t01.PK = t02.PK AND
PK1.TheDate = DATE'9999-12-31';
In the optimized query, we moved the comparison logic into a subquery, which aggregates only the primary key column. Everything else stays the same.
There is one significant advantage the new query has:
Although it involves an aggregation step (DISTINCT) and the same AMP-local join step, the aggregation is done AMP-local (no redistribution is needed). The not optimized query required a costly redistribution step. Furthermore, only one column has to be sorted. In our old query, the AMPs had to sort by 60 columns, which is much more expensive.
Performance tuning should always be accomplished by some measures to prove the success. Here they are:
Disk IOs old: 1,878,357
Disk IOs new: 163,532
We were able to cut disk IOs by almost 91%!
Similarly, the situation is for the required spool space:
Spool Space old: approx. 16.3 GB
Spool Space new: approx. 3.3 GB
We reduced spool usage by approx. 80%!
I hope our tuning example proved how slight changes in a query could have an enormous impact on resource usage.