We have to be careful when applying SQL functions on WHERE condition columns. They will often take the possibility from the Optimizer to use available statistics. Without collected statistics, estimations are based on random-AMP sampling (indexed columns) or Heuristics (non-indexed columns). The downside of Heuristics is described in this article: Teradata Heuristics.
Wrong physical data models are almost always the reason behind using functions on WHERE condition columns. We live in a time of cost minimization, Â but keeping the famous sentence “bought cheaply will be paid dearly” in mind.
In the example below, basic normalization rules have been violated:
The column CustomerType contains two kinds of information and should be better split into two columns:
COLLECT STATISTICS ON Customer COLUMN(CustomerType);
SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) || SUBSTR(CustomerType,10,4) = ‚GOLDPRIV‘;
The Optimizer can’t use the collected statistics on the column CustomerType. It has to apply simple heuristic estimations.
If we assume that our Customer table contains 100,000 rows and we store 100 “Gold” customers in the “Private” sector, the resulting execution plan will look like this:
“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of
(“((SUBSTR(Customer .ProductCode,1 ,4 ))||(SUBSTR(Customer .ProductCode ,10 ,4)))=‚GOLDPRIV'”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 10,000 rows.“
The Optimizer assumes that 10% of the table rows will be selected (Heuristics). This estimation is entirely wrong. In reality, we are selecting only 100 rows!
Our example is straightforward. The execution plan contains only one retrieve step. The WHERE condition column is not indexed, and the estimation doesn’t impact the execution plan. Assuming that no alternative access paths (join indexes, secondary indexes) are available, the plan will always be the same: Â A full table scan.
The execution plans of queries are usually much more complex than in our simple example, and the above retrieve step would be just one step out of many.
Like our example, an entirely wrong estimation would lead to a lousy performing query.
How can we improve our example query? We have to improve the estimations for the retrieving step.
Below is an example of how the query could be rewritten. The optimized query allows the Optimizer to use the collected statistics on column CustomerType, and as a consequence, the retrieving step evaluates to “low confidence.” The estimated number of selected rows reflects reality:
SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) = ‚GOLD‘ AND SUBSTR(CustomerType,10,4) = ‚PRIV‘;
“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of (“((SUBSTR(Customer.CustomerType,1 ,3 ))= ‘001’)
AND ((SUBSTR(Customer .CustomerType,6 ,3 ))= ‘999’)”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 100 rows.“
There is a basic rule you should follow when designing your queries using complex WHERE condition expressions:
If WHERE condition column functions are nested, the SUBSTR and LIKE functions should always be applied first.
The reason: The Optimizer can use statistics of columns where only a part of a character string is queried – the way the statistic histograms are storing information makes this possible.
Unfortunately, statistics can’t be used if a TRIM function is the first applied (not consistently, there are some exceptions like the functions UPPER or LOWER).
The best advice I can give: Before ending up fixing bad queries, try to fix your data model!
Hi –
Thanks for the post.
This is fine for the retrieval SQL if the substring() is part of the join operation how the stats histogram will act.
SUBSTRING() consumes more CPU, What is the equivalent function that would result in the same with less CPU and I/O.