It is important to exercise caution when utilizing SQL functions on WHERE condition columns as they can limit the Optimizer’s ability to employ available statistics. Without collected statistics, estimations are conducted via either random AMP sampling (for indexed columns) or Heuristics (for non-indexed columns). A notable disadvantage of Heuristics is discussed in the following article: Teradata Heuristics.

Incorrect physical data models often use functions on columns within WHERE conditions. While minimizing costs is important, one should also consider the adage that purchasing something cheaply may ultimately result in higher prices.

The following example violates basic normalization rules:

The CustomerType column contains dual information and requires division into two distinct columns.

SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) || SUBSTR(CustomerType,10,4) = ‚GOLDPRIV‘;

The Optimizer cannot utilize the gathered statistics for the CustomerType column and must instead rely on basic heuristic approximations.

Assuming our Customer table has 100,000 rows and we keep 100 customers with a “Gold” status in the “Private” category, the execution plan will appear as follows:

“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of
(“((SUBSTR(Customer .ProductCode,1 ,4 ))||(SUBSTR(Customer .ProductCode ,10 ,4)))=‚GOLDPRIV'”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with no confidence to be 10,000 rows.

The Optimizer assumes a 10% row selection rate based on heuristics, which is inaccurate. The actual selection rate is only 100 rows.

The example is simple: the execution plan includes just one retrieve step. The column in the WHERE condition is not indexed, and its estimation does not affect the execution plan. The plan will inevitably entail a full table scan if no alternative access paths (such as join indexes or secondary indexes) are present.

Query execution plans are often more intricate than in the aforementioned simple instance. Retrieval would only be one stage among numerous other steps. Miscalculations would result in poor query performance, similar to the example provided.

How can we enhance our query? We need to improve the estimations for retrieval.

Below is an example of how the query could be rewritten. The optimized query allows the Optimizer to use the collected statistics on column CustomerType, and as a consequence, the retrieving step evaluates to “low confidence.” The estimated number of selected rows reflects reality:

SELECT * FROM Customer WHERE SUBSTR(CustomerType,1,4) = ‚GOLD‘ AND SUBSTR(CustomerType,10,4) = ‚PRIV‘;

“We do an all-AMPs RETRIEVE step from Customer by way of an all-rows scan with a condition of (“((SUBSTR(Customer.CustomerType,1 ,3 ))= ‘001’)
AND ((SUBSTR(Customer .CustomerType,6 ,3 ))= ‘999’)”)
into Spool 1, which is built locally on the AMPs.
The size of Spool 1 is estimated with low confidence to be 100 rows.

When designing queries with complex WHERE conditions, adhering to a fundamental rule is crucial.

When nesting WHERE condition column functions, prioritize applying the SUBSTR and LIKE functions.

The Optimizer can utilize column statistics even when querying only a partial character string by leveraging the information stored in statistic histograms.

Unfortunately, statistics cannot be applied if the first function used is TRIM, although there are some exceptions, such as the functions UPPER or LOWER.

My most valuable advice is to prioritize fixing your data model before attempting to address problematic queries.

  • Hi –

    Thanks for the post.

    This is fine for the retrieval SQL if the substring() is part of the join operation how the stats histogram will act.

    SUBSTRING() consumes more CPU, What is the equivalent function that would result in the same with less CPU and I/O.

  • {"email":"Email address invalid","url":"Website address invalid","required":"Required field missing"}

    You might also like