Teradata Tuning Tips to make your Queries run faster

2
813
teradata tuning

Befteradata tuningore we get into details, I would like to show you two queries (Teradata 14.10):

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;

CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(10)
) PRIMARY INDEX (PK);

COLLECT STATISTICS COLUMN (PK) ON TheTable;
COLLECT STATISTICS COLUMN (TheCol) ON TheTable;

In our test scenario, the table “TheTable” contains 200,000 rows. We collected statistics on all columns (indexed and non-indexed).

Without doubt both queries deliver the same result. Which query would you prefer from a performance point of view? The answer is not immediately clear.

We have to ask us the following questions: Can the Optimizer use statistics on column “TheCol”?

Let’s analyse the execution plan of the first query:

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;
3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol,1 ,1 ))<= ‘2’) AND ((SUBSTR(TheTable.TheCol, ,1 ,1 ))>= ‘1’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory, but it is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 40,000 rows (1,440,000 bytes).

The Optimizer estimates the size of the result set to be 40,000 rows.
The alert reader will have recognized that this are exactly 20% of the table rows. Furthermore, the estimation is with “no confidence“. This two facts are enough to know that the Optimizer was using heuristics to estimate the number of rows:

It was not able to use the collected statistics. The 20% estimation are the heuristics used for estimating closed ranges.

Let us return now to our second query (statistics and everything else stay unchanged):

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);

3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘1’) OR ((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘2’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory, but it is eligible for synchronized scanning. The result spool file will not be
cached in memory. The size of Spool 1 is estimated with low confidence to be 120,493 rows (4,337,748 bytes).

The explain plan reveals that the Optimizer is using the statistics on column “TheCol”. The estimation is of type “low confidence” and the estimated number of rows is close to reality (you have to believe me in this point, the selected number of rows is about 122,000).

Conclusion:

If you query closed ranges with a few values,  better use the IN(val1,…,valn) variant. This allows the Optimizer to use the statistics.

Finally, I would like to show you another possibility you have with Teradata 14.10:

Teradata Tuning with Expression Statistics

COLLECT STATS COLUMN SUBSTR(TheCol,1,1)  AS CHAR1 ON TheTable;

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;

If statistics are available on the expression, the Optimizer can use them for both queries with “high confidence”.

Still, use expression statistics  sparingly. They are only useful for specific queries while column statistics have a much bigger scope of application (our test scenario can avoid expression statistics and be solved with column statistics and the IN list syntax).

Summary:

Each of us learns SQL in a different environment and from different educational sources. Many times, we are using certain statements only because we are used to it. I hope this article demonstrated that sometimes it’s paying off to take a look over the fence.

Our Reader Score
[Total: 10    Average: 3.9/5]
Teradata Tuning Tips to make your Queries run faster written by Roland Wenzlofsky on June 27, 2015 average rating 3.9/5 - 10 user ratings

2 COMMENTS

  1. I just saw that this was wrong. The table contains 200,00 rows. I fixed the article. Thanks.

    Regarding the first and second execution plan: As the Optimizer estimates a fixed value of 20%, the confidence level is “no confidence”. The second plan is estimated with “low confidence” which is a hint that the statistics are used.

  2. “The Optimizer estimates the size of the result set to be 40,000 rows.
    The alert reader will have recognized that this are exactly 20% of the table rows. ”
    You said that table has only 100000 rows , so how come it’s 20% ?
    On what basis you are saying that second execution plan is using statistics on ‘TheCol’ ? Both the execution plan are almost same.

LEAVE A REPLY

Please enter your comment!
Please enter your name here