Teradata SQL Performance: IN Vs BETWEEN And The Use Of Statistics On Columns

I want to present two queries for Teradata 14.10 before we delve into the specifics.

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN '1' AND '2';

CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(10)
) PRIMARY INDEX (PK);

COLLECT STATISTICS COLUMN (PK) ON TheTable;
COLLECT STATISTICS COLUMN (TheCol) ON TheTable;

The “TheTable” table in our test scenario comprises 200,000 rows. We obtained statistics on each column, including those that are indexed and non-indexed.

Both queries yield identical results. However, which query performs better is not immediately obvious.

Must the Optimizer utilize statistics for “TheCol”?

Let us examine the first query’s execution plan.

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
3) We do an all-AMPs RETRIEVE step from Indexing. Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable. TheCol,1,1))<= ‘2’) AND ((SUBSTR(TheTable. TheCol,,1,1))>= ‘1’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 40,000 rows (1,440,000 bytes).

The Optimizer approximates the result set to contain 40,000 rows, which is precisely 20% of the table’s total rows. It’s worth noting that this estimation holds no confidence, indicating that the optimizer used heuristics to determine the number of rows.

The collected statistics were unusable. The 20% heuristic is used for estimating closed ranges.

Let’s revisit our second query without altering any statistics:

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');

3) We do an all-AMPs RETRIEVE step from Indexing. Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable. TheCol ,1,1))= ‘1’) OR ((SUBSTR(TheTable. TheCol,1,1))= ‘2’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be
cached in memory. The size of Spool 1 is estimated with low confidence to be 120,493 rows (4,337,748 bytes).

The explain plan indicates that the Optimizer utilizes statistics found in the “TheCol” column. The estimation is of “low confidence” type, and the projected number of rows closely approximates the actual number of 122,000.

Conclusion:

To optimize queries for closed ranges with limited values, use the IN(val1,…,valn) variation. This enables the Optimizer to utilize statistics.

Lastly, allow me to present an alternative option available in Teradata 14.10:

Teradata Tuning with Expression Statistics

COLLECT STATS COLUMN SUBSTR(TheCol,1,1)  AS CHAR1 ON TheTable;

SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN ('1','2');
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN '1' AND '2';

The Optimizer can utilize available statistics for both queries with high confidence.

Limiting the use of expression statistics is advisable since they are only beneficial for certain queries. Column statistics, on the other hand, have a broader range of uses. In our test scenario, for instance, we can solve the issue using column statistics and the IN list syntax, thereby avoiding the need for expression statistics.

Summary:

SQL is typically studied in diverse settings and through multiple educational resources. Our familiarity with specific statements can lead to their frequent use. This article highlights the benefits of exploring alternative approaches.

Related Services

⚡ Need Help Optimizing Your Data Platform?

We cut data platform costs by 30–60% without hardware changes. 25+ years of hands-on tuning experience.

Explore Our Services →

📋 Considering a Move From Teradata?

Get a personalized migration roadmap in 2 minutes. We have migrated billions of rows from Teradata to Snowflake, Databricks, and more.

Free Migration Assessment →

2 thoughts on “Teradata SQL Performance: IN vs BETWEEN and the Use of Statistics on Columns”

Vijay Pal

06/30/2015 at 11:27 am

“The Optimizer estimates the size of the result set to be 40,000 rows.
The alert reader will have recognized that this is exactly 20% of the table rows. ”
You said that the table has only 100000 rows, so how come it’s 20%?
On what basis you are saying that the second execution plan is using statistics on ‘TheCol’? Both the execution plan are almost the same.
Roland Wenzlofsky

06/30/2015 at 11:49 am

I just saw that this was wrong. The table contains 200,00 rows. I fixed the article. Thanks.

Regarding the first and second execution plan: As the Optimizer estimates a fixed value of 20%, the confidence level is “no confidence”. The second plan is estimated with “low confidence” which is a hint that the statistics are used.

Teradata SQL Performance: IN vs BETWEEN and the Use of Statistics on Columns

Teradata Tuning with Expression Statistics

📊 Data Platform Migration Survey

Stay Ahead in Data Warehousing

2 thoughts on “Teradata SQL Performance: IN vs BETWEEN and the Use of Statistics on Columns”

Leave a Comment Cancel reply