Teradata Tuning – Query Rewriting

3
1136
Teradata Tuning

Teradata Tuning

Teradata Tuning with Query Rewriting

In this article, I will show you an SQL tuning I did in one of my recent projects.  It demonstrates impressively how query rewriting can be used to meet the dramatic reduction in resource usage.

The following SQL statement was identified as being problematic because it had a very longexecution times:

 

SELECT
t01.PK
,t01.TheDate
,t01.COL1
,t01.COL2

,t01.COL59
,t01.COL60
,CASE WHEN t02.PK IS NULL THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
FROM
TheTable t01
LEFT JOIN
TheTable t02
ON
t01.PK = t02.PK AND
AND PK2.TheDate <> DATE’9999-12-31′
WHERE
PK1.TheDate = DATE’9999-12-31′
GROUP BY 1,2,3…,59,60;

This is the related table DDL:

CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheDate DATE FORMAT ‘YYYY-MM-DD’,
COL1 VARCHAR(10),
..
COL60 VARCHAR(255)
) PRIMARY INDEX(PK);

First of all, let me give you some details about the query above: We are selecting from a table containing one record per object and efficient date (columns “PK” and “TheDate”). The table has about 150 million rows and a huge number of columns.

The idea behind the query is to select all current rows of each object (which can be identified by the largest date ‘9999-12-31’) and show if there exists at least one historical row (with an older effective date). We use the column “Type_Indicator” to differentiate between objects with history and such objects without history (“OLD” or “NEW”)

If you briefly analyze the SQL statement, you will see that it correctly delivers the above-described result.

But I already mentioned: The query is performing bad, having run times of up to 2 hours.

Let’s dissect the query and find out its weak points:

As we are doing a self-join, we get back each current row several times (namely, each time it matches with a historical row).

If we want to select only one current row, a GROUP BY statement is needed.
Although the join can be done AMP-local (as the join column matches the Primary Index), we have a costly redistribution step for the GROUP BY statement. Furthermore, the spool has to be sorted by all 60 columns to be able to do the grouping.

Obviously, costs are increasing rapidly with the number of table rows and columns being grouped.

Now that we know the query’s weak points let’s analyze how we could improve it.

Actually, in the original query, we are joining all historical rows against the current row only to detect if there exists at least one historical row. We can do this in a better way:

SELECT
t01.PK
,t01.TheDate
,t01.COL1
,t01.COL2

,t01.COL59
,t01.COL60
, CASE WHEN t02.PK IS NULL THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
FROM
TheTable t01
LEFT JOIN
(
SELECT DISTINCT PK
FROM  TheTable
WHERE TheDate <> DATE’9999-12-31′
) t02
ON
t01.PK = t02.PK AND
WHERE
PK1.TheDate = DATE’9999-12-31′;

In above query, I was moving the compare logic into a subquery, which is aggregating only the primary key column. Everything else stays the same.

There is one significant advantage the new query has: 

Although it still involves an aggregation step (DISTINCT) and the same AMP-local join step, the aggregation is done AMP-local (no redistribution is needed). Our old query required a costly redistribution step.

Furthermore, only one column has to be sorted. In our old query, the AMPs had to sort by 60 columns, which is much more expensive.

Performance tuning should always be accomplished by some measures, to prove the success. Here they are:

Disk IOs old: 1,878,357
Disk IOs new:  163,532

We were able to cut disk IOs by almost 91%!

Similarly, the situation is for the required spool space:

Spool Space old:  approx. 16.3 GB
Spool Space new: approx. 3.3 GB

We reduced spool usage by approx. 80%!

I hope that our tuning example showed impressively how slight changes in a query often could have an enormous impact on resource usage.

 

Our Reader Score
[Total: 9    Average: 3.2/5]
Teradata Tuning – Query Rewriting written by Roland Wenzlofsky average rating 3.2/5 - 9 user ratings

3 COMMENTS

  1. Hi Ronald

    I modified your query and removed Col1 … Col60 and the Group by Clause from end for simplicity.
    So the updated queries look like :

    /*-=======Original Query==========- */
    SELECT
    t01.col_pk
    ,t01.col_date
    ,CASE WHEN t02.col_pk IS NULL THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
    FROM
    vt_test t01
    LEFT JOIN
    vt_test t02
    ON
    t01.col_pk = t02.col_pk
    AND t02.col_date DATE ‘9999-12-31’
    WHERE
    t01.col_date = DATE ‘9999-12-31’;

    /*-=======Tuned Query=========- */

    SELECT
    t01.col_pk
    ,t01.col_date
    ,CASE WHEN t02.col_pk IS NULL THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
    FROM
    vt_test t01
    LEFT JOIN
    (SEL DISTINCT col_pk FROM vt_test WHERE col_date DATE ‘2999-12-31’
    )t02
    ON
    t01.col_pk = t02.col_pk
    WHERE
    t01.col_date = DATE ‘9999-12-31’;

    Now when I compare the explain plan for both the queries I see the Tuned one actually has 5 steps however in original query there are just 2 steps. Also the estimate time was very less for Original query compared to Tuned query. If we just do collect stats on original query won’t it help ? Since Tuned query has 5 steps and involves 4 spool creation where as Original query has just 2 steps and 1 spool is created for end result. My guess is Original query should be better than Tuned.
    Also please share more information on below two points:

    Disk IOs old: 1,878,357
    Disk IOs new: 163,532

    Spool Space old: approx. 16.3 GB
    Spool Space new: approx. 3.3 GB

    How did you calculate it ? Is it by getting the values from DBQL and adding IO and SpoolUsage for entire session ? Or is it comparison at step level ?

    My assumption of removing Col1…Col60 and removing Group by clause might have changed the explain plan for both the queries. Hence I may be wrong. Kindly correct me.

    Cheers
    Nitin

  2. Hi Ronald,

    As i see, the main objective of above sql is to set active_indicator = ‘NEW’ for current product (which can be identified by the largest date ‘9999-12-31’) else ‘CHANGED’.

    this can be also be done as

    SELECT
    t01.PK
    ,t01.TheDate
    ,t01.COL1
    ,t01.COL2

    ,t01.COL59
    ,t01.COL60
    ,CASE WHEN t01.date = ‘9999-12-31’ THEN ‘NEW’ ELSE ‘CHANGED’ END AS Type_Indicator
    FROM
    TheTable t01

    Why we need join in the first place..?
    What if new record is not identified by static value but is different of all products (versioned data)..?

    Thanks

    • The idea of this query is to check if any historical rrecords exists, but the attribute values of the current one (with data ‘9999-12-31’) should be selected. It’s only checking if history records exists. If yes, then we identify the current records as UPDATE otherwise it’s NEW.

LEAVE A REPLY

Please enter your comment!
Please enter your name here