Teradata SQL Performance: IN vs BETWEEN and the Use of Statistics on Columns

Roland Wenzlofsky

April 23, 2023

minutes reading time

I want to present two queries for Teradata 14.10 before we delve into the specifics.

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN '1' AND '2';

CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(10)
) PRIMARY INDEX (PK);

COLLECT STATISTICS COLUMN (PK) ON TheTable;
COLLECT STATISTICS COLUMN (TheCol) ON TheTable;

The “TheTable” table in our test scenario comprises 200,000 rows. We obtained statistics on each column, including those that are indexed and non-indexed.

Undoubtedly, both queries yield identical results. However, which query would be more favorable regarding performance remains uncertain.

Must the Optimizer utilize statistics for “TheCol”?

Let us examine the first query’s execution plan.

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;
3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol,1 ,1 ))<= ‘2’) AND ((SUBSTR(TheTable.TheCol, ,1 ,1 ))>= ‘1’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 40,000 rows (1,440,000 bytes).

The Optimizer approximates the result set to contain 40,000 rows, which is precisely 20% of the table’s total rows. It’s worth noting that this estimation holds no confidence, indicating that the optimizer used heuristics to determine the number of rows.

The collected statistics were unusable. The 20% heuristic is utilized for estimating closed ranges.

Let’s revisit our second question without altering any statistics:

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');

3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘1’) OR ((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘2’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be
cached in memory. The size of Spool 1 is estimated with low confidence to be 120,493 rows (4,337,748 bytes).

The explain plan indicates that the Optimizer utilizes statistics found in “TheCol” column. The estimation is of “low confidence” type, and the projected number of rows closely approximates the actual number of 122,000.

Conclusion:

To optimize the query for closed ranges with limited values, use the IN(val1,…,valn) variation. This enables the Optimizer to utilize the statistics.

Lastly, allow me to present an alternative option available through Teradata 14.10:

Teradata Tuning with Expression Statistics

COLLECT STATS COLUMN SUBSTR(TheCol,1,1)  AS CHAR1 ON TheTable;

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN '1' AND '2';

The Optimizer can utilize available statistics for queries with high confidence.

Limiting the use of expression statistics is advisable since they are only beneficial for certain queries. On the other hand, column statistics have a broader range of uses. For instance, in our test scenario, we can solve the issue using column statistics and the IN list syntax, thereby avoiding the need for expression statistics.

Summary:

SQL is typically studied in diverse settings and through multiple educational resources. Our familiarity with specific statements can prompt their frequent use. This article serves to highlight the benefits of exploring alternative approaches.

Take a brief survey and receive the book for free Get instant access

Share0

Tweet0

Share0

Roland Wenzlofsky

Roland Wenzlofsky is an experienced freelance Teradata Consultant & Performance Trainer. Born in Austria's capital Vienna, he is building and tuning some of the largest Teradata Data Warehouses in the European financial and telecommunication sectors for more than 20 years. He has played all the roles of developer, designer, business analyst, and project manager. Therefore, he knows like no other the problems and obstacles that make many data warehouse projects fail and all the tricks and tips that will help you succeed.

“The Optimizer estimates the size of the result set to be 40,000 rows.
The alert reader will have recognized that this is exactly 20% of the table rows. ”
You said that the table has only 100000 rows, so how come it’s 20%?
On what basis you are saying that the second execution plan is using statistics on ‘TheCol’? Both the execution plan are almost the same.

I just saw that this was wrong. The table contains 200,00 rows. I fixed the article. Thanks.

Regarding the first and second execution plan: As the Optimizer estimates a fixed value of 20%, the confidence level is “no confidence”. The second plan is estimated with “low confidence” which is a hint that the statistics are used.

Teradata SQL Performance: IN vs BETWEEN and the Use of Statistics on Columns

Teradata Tuning with Expression Statistics

Roland Wenzlofsky

Fast multi-file export of Teradata query results using only Teradata SQL Assistant

The Teradata AMP Worker Task

Boost Your Teradata Performance – The Critical Role of NOT NULL Declarations

Optimizing Teradata SQL Queries by Avoiding Full Table Scans and Utilizing Secondary Indexes