Before we get into details, I would like to show you two queries (Teradata 14.10):
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;
CREATE TABLE TheTable
(
PK INTEGER NOT NULL,
TheCol CHAR(10)
) PRIMARY INDEX (PK);
COLLECT STATISTICS COLUMN (PK) ON TheTable;
COLLECT STATISTICS COLUMN (TheCol) ON TheTable;
In our test scenario, the table “TheTable” contains 200,000 rows. We collected statistics on all columns (indexed and non-indexed).
Without a doubt, both queries deliver the same result. Which query would you prefer from a performance point of view? The answer is not immediately apparent.
We have to ask ourselves the following questions: Can the Optimizer use statistics on column “TheCol”?
Let’s analyze the execution plan of the first query:
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1)Â BETWEEN ‘1’ AND ‘2’;
3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol,1 ,1 ))<= ‘2’) AND ((SUBSTR(TheTable.TheCol, ,1 ,1 ))>= ‘1’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning.
The size of Spool 1 is estimated with no confidence to be 40,000 rows (1,440,000 bytes).
The Optimizer estimates the size of the result set to be 40,000 rows.
The alert reader will have recognized that this is exactly 20% of the table rows. Furthermore, the estimation is with “no confidence“. These two facts are enough to know that the Optimizer was using heuristics to estimate the number of rows:
It was not able to use the collected statistics. The 20% estimation is the heuristic used for estimating closed ranges.
Let us return now to our second query (statistics and everything else stays unchanged):
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
3) We do an all-AMPs RETRIEVE step from Indexing.Test2 by way of an all-rows scan with a condition of (“((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘1’) OR ((SUBSTR(TheTable.TheCol ,1 ,1 ))= ‘2’)”) into Spool 1 (group_amps), which is built locally on the AMPs.
The input table will not be cached in memory but is eligible for synchronized scanning. The result spool file will not be
cached in memory. The size of Spool 1 is estimated with low confidence to be 120,493 rows (4,337,748 bytes).
The explain plan reveals that the Optimizer uses the statistics in column “TheCol”. The estimation is of type “low confidence”, and the estimated number of rows is close to reality (you have to believe me, the selected number of rows is about 122,000).
Conclusion:
If you query closed ranges with a few values, better use the IN(val1,…,valn) variant. This allows the Optimizer to use the statistics.
Finally, I would like to show you another possibility you have with Teradata 14.10:
Teradata Tuning with Expression Statistics
COLLECT STATS COLUMN SUBSTR(TheCol,1,1) AS CHAR1 ON TheTable;
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) IN (‘1′,’2’);
SELECT * FROM TheTable WHERE SUBSTR(TheCol,1,1) BETWEEN ‘1’ AND ‘2’;
If statistics are available on the expression, the Optimizer can use them for both queries with “high confidence”.
Still, use expression statistics sparingly. They are only helpful for specific queries, while column statistics have a much more extensive application (our test scenario can avoid expression statistics and be solved with column statistics and the IN list syntax).
Summary:
We learn SQL in a different environment and from various educational sources. We often use certain statements only because we are used to them. I hope this article demonstrated that sometimes it’s paying off to take a look over the fence.
“The Optimizer estimates the size of the result set to be 40,000 rows.
The alert reader will have recognized that this is exactly 20% of the table rows. ”
You said that the table has only 100000 rows, so how come it’s 20%?
On what basis you are saying that the second execution plan is using statistics on ‘TheCol’? Both the execution plan are almost the same.
I just saw that this was wrong. The table contains 200,00 rows. I fixed the article. Thanks.
Regarding the first and second execution plan: As the Optimizer estimates a fixed value of 20%, the confidence level is “no confidence”. The second plan is estimated with “low confidence” which is a hint that the statistics are used.