Teradata Statistics – How to Boost the Confidence
We have to be careful when applying SQL functions on WHERE condition columns. Many times they will take the possibility from the Optimizer to use available statistics. Without collected statistics, estimations are based on random-AMP sampling (indexed columns) or Heuristics (non-indexed columns). The downside of Heuristics is described in this article: Teradata Heuristics.
Wrong physical data models are almost always the reason behind the necessity to use functions on WHERE condition columns. We are living in a time of cost minimisation, but keep the famous sentence “bought cheaply will be paid dearly” in mind.
In the example below, basic normalization rules have been violated:
The column CustomerType contains two kinds of information, and should be better split into two columns:
The Optimizer can't use the collected statistics on column CustomerType. It has to apply simple heuristic estimations.
If we assume that our Customer table contains 100,000 rows and we store 100 “Gold” customers in the “Private” sector, the resulting execution plan will look like this:
“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of
(“((SUBSTR(Customer .ProductCode,1 ,4 ))||(SUBSTR(Customer .ProductCode ,10 ,4)))=‚GOLDPRIV'”)
into Spool 1 which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 10,000 rows.”
The Optimizer assumes that 10% of the table rows will be selected (Heuristics). This estimation is completely wrong. In reality, we are selecting only 100 rows!
Our example is very simple. The execution plan contains only one retrieve step. The WHERE condition column is not indexed, and the estimation doesn't have any impact on the execution plan. Assuming that there are not alternative access paths (join indexes, secondary indexes) available, the plan will always be the same: A full table scan.
But usually the execution plans of queries are much more complex than in our simple example, and above retrieve step would be just one step out of many.
An entirely wrong estimation, like the one in our example, would most likely lead to a bad performing query.
How can we improve our example query? We have to improve the estimations for the retrieving step.
Below is an example how the query could be rewritten. The optimized query allows the Optimizer to use the collected statistics on column CustomerType, and as a consequence, the retrieving step evaluates to “low confidence.” The estimated number of selected rows is reflecting reality:
SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) = ‚GOLD‘ AND SUBSTR(CustomerType,10,4) = ‚PRIV‘;
“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of (“((SUBSTR(Customer.CustomerType,1 ,3 ))= ‘001')
AND ((SUBSTR(Customer .CustomerType,6 ,3 ))= ‘999')”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 100 rows.”
There is a basic rule you should follow when designing you queries which are using complex WHERE condition expressions:
If WHERE condition column functions are nested, the SUBSTR and LIKE functions should always be applied first.
The reason: The Optimizer can use statistics of columns where only a part of a character string is queried – the way the statistic histograms are storing information makes this possible.
Unfortunately, statistics can't be used, if a function such as TRIM is the first one being applied (not always, there are some exceptions like the functions UPPER or LOWER).
The best advice I can give: Before ending up fixing bad queries, try to fix your data model!