Improving Teradata Performance with SQL Query Rewriting

Roland Wenzlofsky

April 23, 2023

minutes reading time

Teradata Tuning with Query Rewriting

This article explores how to enhance the efficiency of SQL statements through rewriting techniques. This is based on a recent project scenario.

This SQL statement had a lengthy runtime and required further analysis:

SELECT
t01.PK
,t01.TheDate
,t01.COL1
,t01.COL2
...
,t01.COL59
,t01.COL60
,CASE WHEN t02.PK IS NULL THEN 'NEW' ELSE 'CHANGED' END AS Type_Indicator
FROM
TheTable t01
LEFT JOIN
TheTable t02
ON
t01.PK = t02.PK AND
AND PK2.TheDate <> DATE'9999-12-31'
WHERE
PK1.TheDate = DATE'9999-12-31'
GROUP BY 1,2,3...,59,60;

The table we choose from has the following format:

CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheDate DATE FORMAT 'YYYY-MM-DD',
COL1 VARCHAR(10),
..
COL60 VARCHAR(255)
) PRIMARY INDEX(PK);

We are choosing from a table that stores one entry for each item and reference date (with columns “PK” and “TheDate”). The table consists of roughly 150 million rows and numerous columns.

The objective of the query is to retrieve the latest records for each object, identifiable by the date ‘9999-12-31’, and indicate whether prior records exist. To distinguish between objects with and without history, we utilize the “Type_Indicator” column, which classifies them as either “OLD” or “NEW”.

Upon analyzing the SQL statement, it yields the desired result. However, the query requires 2 hours to complete.

Our initial step is to examine each component of the query and pinpoint any flaws in the implementation strategy.

During a self-join, Teradata repeatedly retrieves each recent row whenever it matches a historical row. To eliminate these duplicates, a GROUP BY statement is necessary. Despite the join being AMP-local due to a matching Primary Index, a costly redistribution step is still required for the GROUP BY statement. Additionally, the spool must be sorted by all 60 columns to enable grouping.

The expenses surge as more rows and columns are grouped together.

After identifying the query’s weaknesses, let us explore methods to enhance it.

To enhance the process, we can efficiently detect the presence of at least one historical row without including all historical rows in the current query.

SELECT
t01.PK
,t01.TheDate
,t01.COL1
,t01.COL2
...
,t01.COL59
,t01.COL60
, CASE WHEN t02.PK IS NULL THEN 'NEW' ELSE 'CHANGED' END AS Type_Indicator
FROM
TheTable t01
LEFT JOIN
(
SELECT DISTINCT PK
FROM  TheTable
WHERE TheDate <> DATE'9999-12-31'
) t02
ON
t01.PK = t02.PK AND
WHERE
PK1.TheDate = DATE'9999-12-31';

The revised query optimizes by transferring the comparison logic to a subquery aggregating the primary key column while leaving everything else unaltered.

The new query has a noteworthy advantage:

Utilizing the DISTINCT function and an AMP-local join completes the aggregation process without the need for redistribution, resulting in a more optimized query. This contrasts the previous inefficient query, which necessitated a costly redistribution step and sorting of 60 columns instead of just one.

Effective performance tuning requires measurable indicators of success. The following are recommended:

Disk IOs old: 1,878,357
Disk IOs new: 163,532

We were able to cut disk IOs by almost 91%!

The situation is the same for the necessary spool space.

Spool Space old: approx. 16.3 GB
Spool Space new: approx. 3.3 GB

We decreased spool consumption by around 80%!

Our tuning example demonstrated how minor query modifications can significantly impact resource consumption.

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

Hi Ronald

I modified your query and removed Col1 … Col60 and the Group by Clause from end for simplicity.
So the updated queries look like :

/*-=======Original Query==========- */
SELECT
t01.col_pk
,t01.col_date
,CASE WHEN t02.col_pk IS NULL THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
FROM
vt_test t01
LEFT JOIN
vt_test t02
ON
t01.col_pk = t02.col_pk
AND t02.col_date DATE ‘9999-12-31’
WHERE
t01.col_date = DATE ‘9999-12-31’;

/*-=======Tuned Query=========- */

SELECT
t01.col_pk
,t01.col_date
,CASE WHEN t02.col_pk IS NULL THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
FROM
vt_test t01
LEFT JOIN
(SEL DISTINCT col_pk FROM vt_test WHERE col_date DATE ‘2999-12-31’
)t02
ON
t01.col_pk = t02.col_pk
WHERE
t01.col_date = DATE ‘9999-12-31’;

Now when I compare the explain plan for both the queries I see the Tuned one actually has 5 steps however in the original query there are just 2 steps. Also, the estimated time was very less for the Original query compared to the Tuned query. If we just do collect stats on the original query won’t it help? Since the Tuned query has 5 steps and involves 4 spool creation where as Original query has just 2 steps and 1 spool is created for end result. My guess is Original query should be better than Tuned.
Also please share more information on below two points:

Disk IOs old: 1,878,357
Disk IOs new: 163,532

Spool Space old: approx. 16.3 GB
Spool Space new: approx. 3.3 GB

How did you calculate it? Is it by getting the values from DBQL and adding IO and SpoolUsage for the entire session? Or is it comparison at step level?

My assumption of removing Col1…Col60 and removing Group by clause might have changed the explain plan for both the queries. Hence I may be wrong. Kindly correct me.

Cheers
Nitin

Hi Roland,

As i see, the main objective of above sql is to set active_indicator = ‘NEW’ for current product (which can be identified by the largest date ‘9999-12-31’) else ‘CHANGED’.

this can be also be done as

SELECT
t01.PK
,t01.TheDate
,t01.COL1
,t01.COL2
…
,t01.COL59
,t01.COL60
,CASE WHEN t01.date = ‘9999-12-31’ THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
FROM
TheTable t01

Why we need join in the first place..?
What if new record is not identified by static value but is different of all products (versioned data)..?

Thanks

Roland Wenzlofsky says:
at
The idea of this query is to check if any historical records exist, but the attribute values of the current one (with data ‘9999-12-31’) should be selected. It’s only checking if history records exist. If yes, then we identify the current records as UPDATE otherwise it’s NEW.
Reply

Improving Teradata Performance with SQL Query Rewriting

Teradata Tuning with Query Rewriting

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes