fbpx

Before we get into details, I would like to show you two queries (Teradata 14.10):

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;

CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(10)
) PRIMARY INDEX (PK);

COLLECT STATISTICS COLUMN (PK) ON TheTable;
COLLECT STATISTICS COLUMN (TheCol) ON TheTable;

In our test scenario, the table “TheTable” contains 200,000 rows. We collected statistics on all columns (indexed and non-indexed).

Without a doubt, both queries deliver the same result. Which query would you prefer from a performance point of view? The answer is not immediately apparent.

We have to ask us the following questions: Can the Optimizer use statistics on column “TheCol”?

Let’s analyze the execution plan of the first query:

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;
3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol,1 ,1 ))<= ‘2’) AND ((SUBSTR(TheTable.TheCol, ,1 ,1 ))>= ‘1’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory, but it is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 40,000 rows (1,440,000 bytes).

The Optimizer estimates the size of the result set to be 40,000 rows.
The alert reader will have recognized that this is exactly 20% of the table rows. Furthermore, the estimation is with “no confidence“. These two facts are enough to know that the Optimizer was using heuristics to estimate the number of rows:

It was not able to use the collected statistics. The 20% estimation is the heuristics used for estimating closed ranges.

Let us return now to our second query (statistics and everything else stays unchanged):

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);

3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘1’) OR ((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘2’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory, but it is eligible for synchronized scanning. The result spool file will not be
cached in memory. The size of Spool 1 is estimated with low confidence to be 120,493 rows (4,337,748 bytes).

The explain plan reveals that the Optimizer is using the statistics on column “TheCol”. The estimation is of type “low confidence” and the estimated number of rows is close to reality (you have to believe me at this point, the selected number of rows is about 122,000).

Conclusion:

If you query closed ranges with a few values,  better use the IN(val1,…,valn) variant. This allows the Optimizer to use the statistics.

Finally, I would like to show you another possibility you have with Teradata 14.10:

Teradata Tuning with Expression Statistics

COLLECT STATS COLUMN SUBSTR(TheCol,1,1)  AS CHAR1 ON TheTable;

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;

If statistics are available on the expression, the Optimizer can use them for both queries with “high confidence”.

Still, use expression statistics sparingly. They are only useful for specific queries, while column statistics have a much bigger application (our test scenario can avoid expression statistics and be solved with column statistics and the IN list syntax).

Summary:

Each of us learns SQL in a different environment and from various educational sources. Many times, we are using certain statements only because we are used to it. I hope this article demonstrated that sometimes it’s paying off to take a look over the fence.

__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"62516":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"62516":{"val":"var(--tcb-skin-color-0)"}},"gradients":[]},"original":{"colors":{"62516":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Previous Article
__CONFIG_colors_palette__{"active_palette":0,"config":{"colors":{"b4fbe":{"name":"Main Accent","parent":-1}},"gradients":[]},"palettes":[{"name":"Default Palette","value":{"colors":{"b4fbe":{"val":"rgb(241, 99, 52)"}},"gradients":[]},"original":{"colors":{"b4fbe":{"val":"rgb(19, 114, 211)","hsl":{"h":210,"s":0.83,"l":0.45}}},"gradients":[]}}]}__CONFIG_colors_palette__
Next Article
Teradata Book Query Performance Tuning
Buy the Book Teradata Query Performance Tuning

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

  • I just saw that this was wrong. The table contains 200,00 rows. I fixed the article. Thanks.

    Regarding the first and second execution plan: As the Optimizer estimates a fixed value of 20%, the confidence level is “no confidence”. The second plan is estimated with “low confidence” which is a hint that the statistics are used.

  • Avatar
    Vijay Pal says:

    “The Optimizer estimates the size of the result set to be 40,000 rows.
    The alert reader will have recognized that this is exactly 20% of the table rows. ”
    You said that the table has only 100000 rows, so how come it’s 20%?
    On what basis you are saying that the second execution plan is using statistics on ‘TheCol’? Both the execution plan are almost the same.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like

    >